Toward General and Robust LLM-enhanced Text-attributed Graph Learning

Zihao Zhang    Xunkai Li    Rong-Hua Li    Bing Zhou    Zhenjun Li    Guoren Wang
Abstract

Recent advancements in Large Language Models (LLMs) and the proliferation of Text-Attributed Graphs (TAGs) across various domains have positioned LLM-enhanced TAG learning as a critical research area. By utilizing rich graph descriptions, this paradigm leverages LLMs to generate high-quality embeddings, thereby enhancing the representational capacity of Graph Neural Networks (GNNs). However, the field faces significant challenges: (1) the absence of a unified framework to systematize the diverse optimization perspectives arising from the complex interactions between LLMs and GNNs, and (2) the lack of a robust method capable of handling real-world TAGs, which often suffer from texts and edge sparsity, leading to suboptimal performance.

To address these challenges, we propose UltraTAG, a unified pipeline for LLM-enhanced TAG learning. UltraTAG provides a unified comprehensive and domain-adaptive framework that not only organizes existing methodologies but also paves the way for future advancements in the field. Building on this framework, we propose UltraTAG-S, a robust instantiation of UltraTAG designed to tackle the inherent sparsity issues in real-world TAGs. UltraTAG-S employs LLM-based text propagation and text augmentation to mitigate text sparsity, while leveraging LLM-augmented node selection techniques based on PageRank and edge reconfiguration strategies to address edge sparsity. Our extensive experiments demonstrate that UltraTAG-S significantly outperforms existing baselines, achieving improvements of 2.12% and 17.47% in ideal and sparse settings, respectively. Moreover, as the data sparsity ratio increases, the performance improvement of UltraTAG-S also rises, which underscores the effectiveness and robustness of UltraTAG-S.

Machine Learning, ICML

1 Introduction

Refer to caption
Figure 1: Performance of diffirent LLM-enhanced TAG learning methods in sparse scenarios. The horizontal axis represents the sparsity ratio of nodes and edges, while the vertical axis denotes classification accuracy. UltraTAG-S has the optimal robustness.

In recent years, the advancements in large language models (LLMs) (Brown et al., 2020) have driven the evolution of graph ML, particularly in Text-Attributed Graphs (TAGs) (He et al., 2024a), which combine nodes, edges, and textual data for applications in social networks, recommendation systems etc. While graph neural networks (GNNs) (Li et al., 2024a) excel at capturing structural information, they struggle with textual data, necessitating the integration of GNNs and LLMs for TAG learning (Zhu et al., 2024; Duan et al., 2023). Despite progress, existing TAG learning methods still face several limitations:

Refer to caption
Figure 2: Overview of UltraTAG for LLM-Enhanced Text-Attributed Graph Learning, which is composed of three independent modules.

Limitation 1: Lack of a unified LLM-enhanced TAG learning framework. As for the current innovation directions of LLM-enhanced TAG learning are disorganized, we recapitulate them from a new perspective: (1) Preprocessing: Data Augmentation (He et al., 2024a; Chen et al., 2024; Wang et al., 2024; Pan et al., 2024), which leverages LLMs to generate enhanced textual representations like soft labels for text augmentation. (2) Feature Engineering: Improved Text Encoder (Chien et al., 2022; Duan et al., 2023), which uses LLMs/LMs to enhance node feature representation with fine-tuning models. (3) Training: Joint Training Mechanism (Zhao et al., 2023; Zhu et al., 2024; Wen & Fang, 2023; Huang et al., 2024), which enhance performance by improving the interactive training mechanism between GNNs and LMs. However, the diverse optimization strategies and goals hinder unified objectives, slowing systematic progress in TAG learning.

Solution 1: UltraTAG: A Unified Pipeline toward General and Robust LLM-enhanced TAG Learning. To address Limitation 1, we propose UltraTAG, as detailed in Sec. 3. UltraTAG is composed of three key modules: Data Augmentation, Text Encoder, and Training Mechanism, as illustrated in Figure 2. These modules integrate three key directions of LLM-enhanced TAG learning, translating innovative approaches into specific optimization objectives for UltraTAG. UltraTAG is highly adaptable to real-world scenarios, offering a flexible solution for diverse applications. Building on this, we introduce UltraTAG-X, a versatile extension tailored to specific challenges like data sparsity (UltraTAG-S), high noise (UltraTAG-N), or dynamic graphs (UltraTAG-D). This adaptability ensures UltraTAG meets complex demands, setting a new standard for LLM-enhanced TAG learning and providing a foundation for future innovations.

Limitation 2: Lack of a Robust Method. In real-world TAGs, data sparsity in nodes and edges is a common issue. For example, privacy measures (Li et al., 2024b) on social networks may restrict access to users’ information. Developing robust methods that maintain performance under such sparse conditions is a challenge. Current approaches often depend on complete text attributes, making them incompatible with sparse graphs and leading to suboptimal results. This robustness focus is specific to sparsity, which involves only missing nodes or edges, but not including data noise with error of node’s text, edge or corresponding label.

Solution 2: UltraTAG-S: An Instance of UltraTAG for Sparse Scenarios. To address Limitation 2, we propose UltraTAG-S, as detailed in Sec. 4. UltraTAG-S is composed of three key modules: (1)LLM-based Robustness Enhancement, (2)LM-based Resilient Representation Learning, and (3)Graph-Enhanced Robust Classifier, as illustrated in Figure 3. To simulate real-world sparse scenarios, we randomly remove node texts and edges from the graph according to a certain ratio. Module 1 of UltraTAG-S employs edge-based text propagation techniques and LLM-based text enhancement techniques to address node sparsity. Module 2 designs a similarity and PageRank-based important node selector and an LLM-based edge predictor to handle the edge sparsity challenge. Module 3 incorporates a graph structure learning module to further enhance structural robustness.

Our Contributions: (1)A Unified Framework. We adopt a novel perspective to systematically examine all existing methods for TAG learning and introduce UltraTAG, a unified and domain-adaptive paradigm which can extend to UltraTAG-X. (2)A Robust Method. Expanding on UltraTAG, we propose UltraTAG-S, a robust TAG learning framework designed specifically for sparse scenarios. (3)SOTA Performance. Our proposed UltraTAG-S achieves SOTA performance and optimal robustness in evaluations among 7 datasets spanning four distinct domains not only in ideal but also in sparse scenarios, exhibiting minimal performance degradation, as shown in Figure 1.

2 Related Works

2.1 Graph Learning for Data Sparsity Scenarios

For graph learning in sparse scenarios, existing research primarily focuses on addressing missing node representations, edge absences, or label deficiencies (Rossi et al., 2022; Guo et al., 2023; Zhang et al., 2022). Most of them employ vector completion techniques based on graph propagation or attention to handle these issues. However, there is still a lack of targeted research on sparse scenarios of TAGs.

2.2 Shallow Embedding Methods for TAG Learning

A common method for TAG learning is shallow embedding, text attributes are often converted into shallow features like skip-gram (Mikolov et al., 2013) or BoW (Harris, 1954), which serve as inputs for graph-based algorithms such as GCN (Kipf & Welling, 2017). While shallow embeddings are simple and efficient, they fail to capture complex semantics and nuanced relationships with limited effectiveness.

2.3 LM/LLM-based Methods for TAG Learning

With the rise of LMs like BERT (Devlin et al., 2019), researchers encode textual information in TAGs by fine-tuning LMs on downstream tasks (Zhao et al., 2023; Duan et al., 2023) or aligning LM and GNN parameter spaces via custom loss functions (Wen & Fang, 2023). The emergence of LLMs like GPT-3 (Brown et al., 2020) has further advanced TAG learning, focusing on: (1) text enhancement (e.g., better node descriptions, labels) (He et al., 2024a; Wang et al., 2024; Pan et al., 2024; He et al., 2024b), and (2) superior text encoding for node representations (Zhu et al., 2024; Huang et al., 2024), collectively boosting performance.

Refer to caption
Figure 3: Overview of UltraTAG-S for LLM-Enhanced Text-Attributed Graph Learning in Sparse Scenarios.

3 UltraTAG

In this section, we provide a detailed introduction to three modules of UltraTAG shown in Figure 2: Data Augmentation, Text Encoder and Training Mechanism.

3.1 Notations

Given a TAG 𝒢={𝒱,𝒯,𝒜,𝒴}𝒢𝒱𝒯𝒜𝒴\mathcal{G}=\{\mathcal{V},\mathcal{T},\mathcal{A},\mathcal{Y}\}caligraphic_G = { caligraphic_V , caligraphic_T , caligraphic_A , caligraphic_Y }, where 𝒱𝒱\mathcal{V}caligraphic_V is the set containing N𝑁Nitalic_N nodes, 𝒯𝒯\mathcal{T}caligraphic_T is the set of texts, for i𝒱𝑖𝒱i\in\mathcal{V}italic_i ∈ caligraphic_V, ti𝒯subscript𝑡𝑖𝒯t_{i}\in\mathcal{T}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T is the text attribute of node i𝑖iitalic_i. 𝒜N×N𝒜superscript𝑁𝑁\mathcal{A}\in\mathbb{R}^{N\times N}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT is the adjacency matrix and 𝒴𝒴\mathcal{Y}caligraphic_Y is the set of ground-truth labels.

This study focuses on the TAG node classification task. The dataset is split into training nodes 𝒱trsubscript𝒱𝑡𝑟\mathcal{V}_{tr}caligraphic_V start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT with training labels 𝒴trsubscript𝒴𝑡𝑟\mathcal{Y}_{tr}caligraphic_Y start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT and testing nodes 𝒱tesubscript𝒱𝑡𝑒\mathcal{V}_{te}caligraphic_V start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT with testing labels 𝒴tesubscript𝒴𝑡𝑒\mathcal{Y}_{te}caligraphic_Y start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT. A model fθsubscript𝑓superscript𝜃f_{\theta^{*}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is trained on 𝒱trsubscript𝒱𝑡𝑟\mathcal{V}_{tr}caligraphic_V start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT and tested on 𝒱tesubscript𝒱𝑡𝑒\mathcal{V}_{te}caligraphic_V start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT to generate predictions. The optimization objective is formalized as:

fθ=argmax𝜃𝔼n𝒱trPθ(y^n=ynn),subscript𝑓superscript𝜃𝜃argmaxsubscript𝔼𝑛subscript𝒱𝑡𝑟subscript𝑃𝜃subscript^𝑦𝑛conditionalsubscript𝑦𝑛𝑛f_{\theta^{*}}=\underset{\theta}{\operatorname*{argmax}}\mathbb{E}_{n\in% \mathcal{V}_{tr}}P_{\theta}(\hat{y}_{n}=y_{n}\mid n),italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = underitalic_θ start_ARG roman_argmax end_ARG blackboard_E start_POSTSUBSCRIPT italic_n ∈ caligraphic_V start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∣ italic_n ) , (1)

where ynsubscript𝑦𝑛y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is ground-truth label and y^nsubscript^𝑦𝑛\hat{y}_{n}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is model prediction.

3.2 Data Augmentation

TAGs rely solely on node texts, not representations, making text preprocessing crucial. To enhance text representation, data augmentation from a natural language perspective is effective. Leveraging LLM’s capabilities, we input 𝒯𝒯\mathcal{T}caligraphic_T and generate augmented texts 𝒯superscript𝒯\mathcal{T}^{{}^{\prime}}caligraphic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT using varied prompts:

𝒯={titi=LLM(Prompt,ti,αLLM),ti𝒯}.superscript𝒯conditional-setsubscriptsuperscript𝑡𝑖formulae-sequencesubscriptsuperscript𝑡𝑖LLMPromptsubscript𝑡𝑖subscript𝛼LLMfor-allsubscript𝑡𝑖𝒯\mathcal{T}^{{}^{\prime}}=\{t^{{}^{\prime}}_{i}\mid t^{{}^{\prime}}_{i}=\text{% LLM}(\text{Prompt},t_{i},\alpha_{\text{LLM}}),\forall t_{i}\in\mathcal{T}\}.caligraphic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = { italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = LLM ( Prompt , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT ) , ∀ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T } . (2)

After generating 𝒯superscript𝒯\mathcal{T}^{{}^{\prime}}caligraphic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, they are typically aggregated with 𝒯𝒯\mathcal{T}caligraphic_T to produce the final texts 𝒯superscript𝒯\mathcal{T}^{*}caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as textual representation:

𝒯={titi=Agg(ti,{tjj𝒩i}),ti𝒯},superscript𝒯conditional-setsubscriptsuperscript𝑡𝑖formulae-sequencesubscriptsuperscript𝑡𝑖Aggsubscriptsuperscript𝑡𝑖conditional-setsubscriptsuperscript𝑡𝑗𝑗subscript𝒩𝑖for-allsuperscriptsubscript𝑡𝑖superscript𝒯\mathcal{T}^{*}=\{t^{*}_{i}\mid t^{*}_{i}=\text{Agg}(t^{{}^{\prime}}_{i},\{t^{% {}^{\prime}}_{j}\mid j\in\mathcal{N}_{i}\}),\forall t_{i}^{{}^{\prime}}\in% \mathcal{T}^{{}^{\prime}}\},caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = Agg ( italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , { italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) , ∀ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } , (3)

where Agg is the text aggregator, which include selection and concatenation and so on, 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is neighbor nodes of i𝑖iitalic_i.

3.3 Text Encoder

The given nodes’ texts 𝒯superscript𝒯\mathcal{T}^{*}caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTmust be encoded into embeddings to facilitate subsequent model processing which can be efficiently accomplished using LMs or LLMs.

LMs as Encoder. Text encoding typically employs LMs like BERT (Devlin et al., 2019). Fine-tuning LMs on downstream tasks enhances their task-specific encoding capability. As for ti𝒯superscriptsubscript𝑡𝑖superscript𝒯t_{i}^{*}\in\mathcal{T}^{*}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, this process can be described as:

hi=LM(ti,θLM)d,ti𝒯,formulae-sequencesubscript𝑖LMsuperscriptsubscript𝑡𝑖subscript𝜃LMsuperscript𝑑for-allsuperscriptsubscript𝑡𝑖superscript𝒯h_{i}=\text{LM}(t_{i}^{*},\theta_{\text{LM}})\in\mathbb{R}^{d},\forall t_{i}^{% *}\in\mathcal{T}^{*},italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = LM ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ∀ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , (4)
θLM=argminθ,θLMi=1NCE(MLP(hi;θ,θLM),yi),subscriptsuperscript𝜃LMsubscript𝜃subscriptsuperscript𝜃LMsuperscriptsubscript𝑖1𝑁CEMLPsubscript𝑖𝜃subscript𝜃LMsubscript𝑦𝑖\theta^{*}_{\text{LM}}=\arg\min_{\theta,\theta^{*}_{\text{LM}}}\sum_{i=1}^{N}% \text{CE}(\text{MLP}(h_{i};\theta,\theta_{\text{LM}}),y_{i}),italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT CE ( MLP ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ , italic_θ start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (5)

where hisubscript𝑖h_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the output of LM, d𝑑ditalic_d is the dimension of the representation, θ𝜃\thetaitalic_θ is the parameters of the linear classifier, CE is the Cross-Entropy loss function.

Meanwhile, various downstream tasks can be used to fine-tune LMs, such as node classification (He et al., 2024a; Duan et al., 2023) or other tasks (Chien et al., 2022).

LLMs as Encoder. Leveraging LLM’s language understanding capabilities, their features from different layers capture varying abstraction levels with versatile representations (Zhu et al., 2024). Specifically, for ti𝒯superscriptsubscript𝑡𝑖superscript𝒯t_{i}^{*}\in\mathcal{T}^{*}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we can get hi1,hi2,hi3,,hilsuperscriptsubscript𝑖1superscriptsubscript𝑖2superscriptsubscript𝑖3superscriptsubscript𝑖𝑙h_{i}^{1},h_{i}^{2},h_{i}^{3},...,h_{i}^{l}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT from different LLM layers:

hi1,hi2,hi3,,hil=LLM(ti,θLLM)d,ti𝒯,formulae-sequencesuperscriptsubscript𝑖1superscriptsubscript𝑖2superscriptsubscript𝑖3superscriptsubscript𝑖𝑙LLMsuperscriptsubscript𝑡𝑖subscript𝜃LLMsuperscript𝑑for-allsuperscriptsubscript𝑡𝑖superscript𝒯h_{i}^{1},h_{i}^{2},h_{i}^{3},...,h_{i}^{l}=\text{LLM}(t_{i}^{*},\theta_{\text% {LLM}})\in\mathbb{R}^{d},\forall t_{i}^{*}\in\mathcal{T}^{*},italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = LLM ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ∀ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , (6)

where hij,j[1,l]superscriptsubscript𝑖𝑗𝑗1𝑙h_{i}^{j},j\in[1,l]italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_j ∈ [ 1 , italic_l ] denotes the output of LLM layer j𝑗jitalic_j of node i𝑖iitalic_i and l𝑙litalic_l is the number of LLM layers selected.

LLM’s interlayer features with multilevel text representations can also be integrated into GNN modules for message passing, optimizing only GNN parameters:

θ=argminθi=1Nj=1LCE((hij,𝒜;θ),yi),superscript𝜃subscript𝜃superscriptsubscript𝑖1𝑁superscriptsubscript𝑗1𝐿CEsuperscriptsubscript𝑖𝑗𝒜𝜃subscript𝑦𝑖\theta^{*}=\arg\min_{\theta}\sum_{i=1}^{N}\sum_{j=1}^{L}\text{CE}(\mathcal{M}(% h_{i}^{j},\mathcal{A};\theta),y_{i}),italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT CE ( caligraphic_M ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , caligraphic_A ; italic_θ ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (7)

where \mathcal{M}caligraphic_M denotes the message passing module of GNN, 𝒜𝒜\mathcal{A}caligraphic_A is the adjacent matrix, CE is the Cross-Entropy loss.

3.4 Training Mechanism

After obtaining the nodes’ textual representations from LM or LLM ={h1,h2,h3,,hN}subscript1subscript2subscript3subscript𝑁\mathcal{H}=\{h_{1},h_{2},h_{3},...,h_{N}\}caligraphic_H = { italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } and adjacency matrix 𝒜𝒜\mathcal{A}caligraphic_A, input of them into a GNN will yield the final prediction. In terms of training mechanisms, we can use a simple GNN module, or combine GNN with LM for joint training.

Simple GNN. A simple GNN produces final predictions through downstream task training and inference:

θ=argminθi=1NCE(GNN(hi,𝒜;θ),yi),superscript𝜃subscript𝜃superscriptsubscript𝑖1𝑁CEGNNsubscript𝑖𝒜𝜃subscript𝑦𝑖\theta^{*}=\arg\min_{\theta}\sum_{i=1}^{N}\text{CE}(\text{GNN}(h_{i},\mathcal{% A};\theta),y_{i}),italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT CE ( GNN ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_A ; italic_θ ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (8)
Output=GNN(={h1,h2,,hN},𝒜;θ),OutputGNNsubscript1subscript2subscript𝑁𝒜superscript𝜃\text{Output}=\text{GNN}\left(\mathcal{H}=\{h_{1},h_{2},\dots,h_{N}\},\mathcal% {A};\theta^{*}\right),Output = GNN ( caligraphic_H = { italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_h start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } , caligraphic_A ; italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , (9)

where θ𝜃\thetaitalic_θ represents the parameters of the LM and the additional task-specific layers, CE is the Cross-Entropy loss function, and hisubscript𝑖h_{i}\in\mathcal{H}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_H is the representation of node i𝑖iitalic_i.

GNN with LM. For the combination of GNN and LM training, the pseudo-labels 𝒴Gsubscript𝒴G\mathcal{Y}_{\text{G}}caligraphic_Y start_POSTSUBSCRIPT G end_POSTSUBSCRIPT generated by GNN guide LM training, and the pseudo-labels generated by LM 𝒴Lsubscript𝒴L\mathcal{Y}_{\text{L}}caligraphic_Y start_POSTSUBSCRIPT L end_POSTSUBSCRIPT guide GNN training, and the cycle repeats:

𝒴L=GNN(,𝒜;θG),𝒴G=LM(𝒯;θL),formulae-sequencesubscript𝒴LGNN𝒜subscript𝜃Gsubscript𝒴GLM𝒯subscript𝜃L\mathcal{Y}_{\text{L}}=\text{GNN}(\mathcal{H},\mathcal{A};\theta_{\text{G}}),% \mathcal{Y}_{\text{G}}=\text{LM}(\mathcal{T};\theta_{\text{L}}),caligraphic_Y start_POSTSUBSCRIPT L end_POSTSUBSCRIPT = GNN ( caligraphic_H , caligraphic_A ; italic_θ start_POSTSUBSCRIPT G end_POSTSUBSCRIPT ) , caligraphic_Y start_POSTSUBSCRIPT G end_POSTSUBSCRIPT = LM ( caligraphic_T ; italic_θ start_POSTSUBSCRIPT L end_POSTSUBSCRIPT ) , (10)
θG=argminθGi=1NCE(GNN(hi,𝒜;θG),yi𝒴L),subscriptsuperscript𝜃Gsubscriptsubscript𝜃Gsuperscriptsubscript𝑖1𝑁CEGNNsubscript𝑖𝒜subscript𝜃Gsubscript𝑦𝑖subscript𝒴L\theta^{*}_{\text{G}}\\ =\arg\min_{\theta_{\text{G}}}\sum_{i=1}^{N}\text{CE}(\text{GNN}(h_{i},\mathcal% {A};\theta_{\text{G}}),y_{i}\in\mathcal{Y}_{\text{L}}),italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT G end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT G end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT CE ( GNN ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_A ; italic_θ start_POSTSUBSCRIPT G end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y start_POSTSUBSCRIPT L end_POSTSUBSCRIPT ) , (11)
θL=argminθLi=1NCE(LM(ti;θL),yi𝒴G).subscriptsuperscript𝜃Lsubscriptsubscript𝜃Lsuperscriptsubscript𝑖1𝑁CELMsubscript𝑡𝑖subscript𝜃Lsubscript𝑦𝑖subscript𝒴G\theta^{*}_{\text{L}}\\ =\arg\min_{\theta_{\text{L}}}\sum_{i=1}^{N}\text{CE}(\text{LM}(t_{i};\theta_{% \text{L}}),y_{i}\in\mathcal{Y}_{\text{G}}).italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT L end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT L end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT CE ( LM ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT L end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y start_POSTSUBSCRIPT G end_POSTSUBSCRIPT ) . (12)

By iteratively training the LM and GNN, the capabilities of both the GNN and LM can be enhanced simultaneously.

4 UltraTAG-S

To address the challenge of data sparsity of TAGs in real-word applications, we creatively propose UltraTAG-S as shown in Figure 3, which is composed of three modules: LLM-based Robustness Enhancement, LM-based Resilient Representation Learning and Graph-Enhanced Robust Classifier. These modules incorporate our unique design for sparse scenarios, such as Text Propagation, Structure Augmentation and Edge Reconfigurator. It should be noted that in this study, we simulate a real-world sparse scenario by deleting nodes’ texts and edges in a certain proportion.

4.1 LLM-based Robustness Enhancement

Data Augmentation module can be divided into Text Propagation, Text Augmentation and Structure Augmentation.

Text Propagation. Leveraging the homophily principle in graph theory, we posit that adjacent nodes exhibit textual similarity. Inspired by the message-passing mechanism in GNNs, we propagate textual information from neighboring nodes to reconstruct missing text attributes. Additionally, this propagation method enriches the textual representation of normal nodes, serving as a complementary enhancement.

Specifically, for node vi𝒱subscript𝑣𝑖𝒱v_{i}\in\mathcal{V}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V, ti𝒯subscript𝑡𝑖𝒯t_{i}\in\mathcal{T}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T is its text and 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the set of its neighbors, propagated texts 𝒯superscript𝒯\mathcal{T}^{{}^{\prime}}caligraphic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT is obtained by:

𝒯={titi=ti{tjj𝒩i},ti𝒯},superscript𝒯conditional-setsubscriptsuperscript𝑡𝑖formulae-sequencesubscriptsuperscript𝑡𝑖direct-sumsubscript𝑡𝑖conditional-setsubscript𝑡𝑗𝑗subscript𝒩𝑖for-allsubscript𝑡𝑖𝒯\mathcal{T}^{{}^{\prime}}=\{t^{{}^{\prime}}_{i}\mid t^{{}^{\prime}}_{i}=t_{i}% \oplus\{t_{j}\mid j\in\mathcal{N}_{i}\},\forall t_{i}\in\mathcal{T}\},caligraphic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = { italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_t start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊕ { italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } , ∀ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T } , (13)

where direct-sum\oplus denotes concatenate of textual information.

Text Augmentation. Leveraging the advanced language comprehension capabilities of LLM, we utilize prompt engineering to harness these models for extracting critical textual information and enriching data representations. Specifically, we have developed a suite of tailored prompts to guide LLM in generating diverse and contextually relevant key texts, including summary, key words and soft labels, thereby enhancing the robustness and informativeness.

Specifically, for the propagated text 𝒯superscript𝒯\mathcal{T}^{{}^{\prime}}caligraphic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, we get the augmented text 𝒯superscript𝒯\mathcal{T}^{*}caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by LLM inference with different prompts:

𝒯Su={ti′′ti′′=LLM(𝒫Su,ti,θLLM),ti𝒯},subscript𝒯Suconditional-setsuperscriptsubscript𝑡𝑖′′formulae-sequencesuperscriptsubscript𝑡𝑖′′LLMsubscript𝒫Susuperscriptsubscript𝑡𝑖subscript𝜃LLMfor-allsuperscriptsubscript𝑡𝑖superscript𝒯\mathcal{T}_{\text{Su}}=\{t_{i}^{{}^{\prime\prime}}\mid t_{i}^{{}^{\prime% \prime}}=\text{LLM}(\mathcal{P}_{\text{Su}},t_{i}^{{}^{\prime}},\theta_{\text{% LLM}}),\forall t_{i}^{{}^{\prime}}\in\mathcal{T}^{{}^{\prime}}\},caligraphic_T start_POSTSUBSCRIPT Su end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∣ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = LLM ( caligraphic_P start_POSTSUBSCRIPT Su end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT ) , ∀ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } , (14)
𝒯KW={ti′′ti′′=LLM(𝒫KW,ti,θLLM),ti𝒯},subscript𝒯KWconditional-setsuperscriptsubscript𝑡𝑖′′formulae-sequencesuperscriptsubscript𝑡𝑖′′LLMsubscript𝒫KWsuperscriptsubscript𝑡𝑖subscript𝜃LLMfor-allsuperscriptsubscript𝑡𝑖superscript𝒯\mathcal{T}_{\text{KW}}=\{t_{i}^{{}^{\prime\prime}}\mid t_{i}^{{}^{\prime% \prime}}=\text{LLM}(\mathcal{P}_{\text{KW}},t_{i}^{{}^{\prime}},\theta_{\text{% LLM}}),\forall t_{i}^{{}^{\prime}}\in\mathcal{T}^{{}^{\prime}}\},caligraphic_T start_POSTSUBSCRIPT KW end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∣ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = LLM ( caligraphic_P start_POSTSUBSCRIPT KW end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT ) , ∀ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } , (15)
𝒴SL={ti′′ti′′=LLM(𝒫SL,ti,θLLM),ti𝒯},subscript𝒴SLconditional-setsuperscriptsubscript𝑡𝑖′′formulae-sequencesuperscriptsubscript𝑡𝑖′′LLMsubscript𝒫SLsuperscriptsubscript𝑡𝑖subscript𝜃LLMfor-allsuperscriptsubscript𝑡𝑖superscript𝒯\mathcal{Y}_{\text{SL}}=\{t_{i}^{{}^{\prime\prime}}\mid t_{i}^{{}^{\prime% \prime}}=\text{LLM}(\mathcal{P}_{\text{SL}},t_{i}^{{}^{\prime}},\theta_{\text{% LLM}}),\forall t_{i}^{{}^{\prime}}\in\mathcal{T}^{{}^{\prime}}\},caligraphic_Y start_POSTSUBSCRIPT SL end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∣ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = LLM ( caligraphic_P start_POSTSUBSCRIPT SL end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_θ start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT ) , ∀ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT } , (16)
𝒯=AGG(𝒯,𝒯Su,𝒯KW,𝒴SL),superscript𝒯AGGsuperscript𝒯subscript𝒯Susubscript𝒯KWsubscript𝒴SL\mathcal{T}^{*}=\text{AGG}(\mathcal{T}^{{}^{\prime}},\mathcal{T}_{\text{Su}},% \mathcal{T}_{\text{KW}},\mathcal{Y}_{\text{SL}}),caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = AGG ( caligraphic_T start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUBSCRIPT Su end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT KW end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT SL end_POSTSUBSCRIPT ) , (17)

where AGG denotes the text aggregation module of concat or select, 𝒯Susubscript𝒯Su\mathcal{T}_{\text{Su}}caligraphic_T start_POSTSUBSCRIPT Su end_POSTSUBSCRIPT, 𝒯KWsubscript𝒯KW\mathcal{T}_{\text{KW}}caligraphic_T start_POSTSUBSCRIPT KW end_POSTSUBSCRIPT, 𝒴SLsubscript𝒴SL\mathcal{Y}_{\text{SL}}caligraphic_Y start_POSTSUBSCRIPT SL end_POSTSUBSCRIPT are the Summary, Key Words and Soft Labels generated by LLMs respectively. 𝒫Susubscript𝒫Su\mathcal{P}_{\text{Su}}caligraphic_P start_POSTSUBSCRIPT Su end_POSTSUBSCRIPT, 𝒫KWsubscript𝒫KW\mathcal{P}_{\text{KW}}caligraphic_P start_POSTSUBSCRIPT KW end_POSTSUBSCRIPT, 𝒫SLsubscript𝒫SL\mathcal{P}_{\text{SL}}caligraphic_P start_POSTSUBSCRIPT SL end_POSTSUBSCRIPT are the corresponding tailored prompts of LLMs.

Structure Augmentation. To mitigate edge sparsity, we introduce a Structure Augmentation module composed of Virtual Edge Generator, Node Selector and Edge Reconfigurator. This module leverages LLMs to re-identify edges for selected nodes, thereby optimizing the graph structure.

a. Virtual Edge Generator. In order to ensure the integrity of the graph structure before node selection, we use the soft labels 𝒯SLsubscript𝒯SL\mathcal{T}_{\text{SL}}caligraphic_T start_POSTSUBSCRIPT SL end_POSTSUBSCRIPT generated by LLMs and calculate the similarity with the same soft label, which can be described as:

hi=LM(ti,αLM),hj=LM(tj,αLM),formulae-sequencesubscript𝑖LMsuperscriptsubscript𝑡𝑖subscript𝛼LMsubscript𝑗LMsuperscriptsubscript𝑡𝑗subscript𝛼LMh_{i}=\text{LM}(t_{i}^{*},\alpha_{\text{LM}}),h_{j}=\text{LM}(t_{j}^{*},\alpha% _{\text{LM}}),italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = LM ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_α start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = LM ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_α start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ) , (18)
𝒮ij=cos(hi,hj),yi=yj&yi,yj𝒴SL.formulae-sequencesubscript𝒮𝑖𝑗𝑐𝑜𝑠subscript𝑖subscript𝑗formulae-sequencefor-allsubscript𝑦𝑖subscript𝑦𝑗subscript𝑦𝑖subscript𝑦𝑗subscript𝒴SL\mathcal{S}_{ij}=cos(h_{i},h_{j}),\forall y_{i}=y_{j}\And y_{i},y_{j}\in% \mathcal{Y}_{\text{SL}}.caligraphic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_c italic_o italic_s ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , ∀ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT & italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_Y start_POSTSUBSCRIPT SL end_POSTSUBSCRIPT . (19)

The adjacency matrix with virtual edges is updated to 𝒜superscript𝒜\mathcal{A}^{{}^{\prime}}caligraphic_A start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT:

𝒜ij={1,if 𝒜ij=1|𝒮ij>τ1,0,else,subscriptsuperscript𝒜𝑖𝑗cases1if subscript𝒜𝑖𝑗1ketsubscript𝒮𝑖𝑗subscript𝜏10else\mathcal{A}^{{}^{\prime}}_{ij}=\begin{cases}1,&\text{if }\mathcal{A}_{ij}=1|% \mathcal{S}_{ij}>\tau_{1},\\ 0,&\text{else},\end{cases}caligraphic_A start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL 1 , end_CELL start_CELL if caligraphic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 | caligraphic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT > italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL else , end_CELL end_ROW (20)

where 𝒜𝒜\mathcal{A}caligraphic_A denotes the adjacency matrix after sparse process, τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denotes the similarity threshold for edges to add.

b. Node Selector. Considering the impracticality of re-judging all edges, we design a node selector to select important nodes set 𝒱csubscript𝒱𝑐\mathcal{V}_{c}caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. We calculate the pagerank score for each node in 𝒱𝒱\mathcal{V}caligraphic_V and use these scores as importance score:

Score(vi)=PageRank(vi,𝒜),Scoresubscript𝑣𝑖PageRanksubscript𝑣𝑖superscript𝒜\text{Score}(v_{i})=\text{PageRank}(v_{i},\mathcal{A}^{{}^{\prime}}),Score ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = PageRank ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_A start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) , (21)
𝒱c={vi|Score(vi)>Score(vk)},subscript𝒱𝑐conditional-setsubscript𝑣𝑖Scoresubscript𝑣𝑖Scoresubscript𝑣𝑘\mathcal{V}_{c}=\{v_{i}\ |\ \text{Score}(v_{i})>\text{Score}(v_{k})\},caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | Score ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > Score ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } , (22)

where vksubscript𝑣𝑘v_{k}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the node with k-th largest node importance score calculated by PageRank algorithm with virtual edges.

c. Edge Reconfigurator. For each edge in the complete graph of 𝒱csubscript𝒱𝑐\mathcal{V}_{c}caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, we use LLM to re-determines its existence with the confidence score 𝒞ijsubscript𝒞𝑖𝑗\mathcal{C}_{ij}caligraphic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT of edge eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT:

𝒞ij=LLM(𝒫edge,ti,tj,αLLM),vi,vj𝒱c,formulae-sequencesubscript𝒞𝑖𝑗LLMsubscript𝒫edgesuperscriptsubscript𝑡𝑖superscriptsubscript𝑡𝑗subscript𝛼LLMfor-allsubscript𝑣𝑖subscript𝑣𝑗subscript𝒱𝑐\mathcal{C}_{ij}=\text{LLM}(\mathcal{P}_{\text{edge}},t_{i}^{*},t_{j}^{*},% \alpha_{\text{LLM}}),\forall v_{i},v_{j}\in\mathcal{V}_{c},caligraphic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = LLM ( caligraphic_P start_POSTSUBSCRIPT edge end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_α start_POSTSUBSCRIPT LLM end_POSTSUBSCRIPT ) , ∀ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , (23)

The updated adjacency matrix 𝒜superscript𝒜\mathcal{A}^{*}caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can be expressed as:

𝒜ij={𝒜ij,if vi𝒱c|vj𝒱c;1,if 𝒞ij>τ2;0,else;subscriptsuperscript𝒜𝑖𝑗casessubscript𝒜𝑖𝑗if subscript𝑣𝑖conditionalsubscript𝒱𝑐subscript𝑣𝑗subscript𝒱𝑐1if subscript𝒞𝑖𝑗subscript𝜏20else\mathcal{A}^{*}_{ij}=\begin{cases}\mathcal{A}_{ij},&\text{if }v_{i}\notin% \mathcal{V}_{c}\ |\ v_{j}\notin\mathcal{V}_{c};\\ 1,&\text{if }\mathcal{C}_{ij}>\tau_{2};\quad 0,\ \text{else};\end{cases}caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL caligraphic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , end_CELL start_CELL if italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∉ caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ; end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL if caligraphic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT > italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; 0 , else ; end_CELL end_ROW (24)

where τ2subscript𝜏2\tau_{2}italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the confidence threshold for reconfiguration.

4.2 LM-based Resilient Representation Learning

After augmenting the graph 𝒢={𝒱,𝒯,𝒜,𝒴}superscript𝒢𝒱superscript𝒯superscript𝒜𝒴\mathcal{G^{*}}=\{\mathcal{V},\mathcal{T}^{*},\mathcal{A}^{*},\mathcal{Y}\}caligraphic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { caligraphic_V , caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , caligraphic_Y }, we fine-tune the language model LMθsubscriptLM𝜃\text{LM}_{\theta}LM start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT for node classification:

y^i=softmax(WLM(ti,θ)+b),ti𝒯,formulae-sequencesubscript^𝑦𝑖softmax𝑊LMsubscript𝑡𝑖𝜃𝑏for-allsuperscriptsubscript𝑡𝑖superscript𝒯\hat{y}_{i}=\text{softmax}(W\cdot\text{LM}(t_{i},\theta)+b),\forall\ t_{i}^{*}% \in\mathcal{T}^{*},over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = softmax ( italic_W ⋅ LM ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ ) + italic_b ) , ∀ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , (25)

where W𝑊Witalic_W is the weight matrix, b𝑏bitalic_b is the bias term.

After fine-tuning with the following negative log-likelihood Loss \mathcal{L}caligraphic_L, the node representations \mathcal{H}caligraphic_H are calculated by:

hi=LM(ti,θ),=1Ni=1Nk=1Kyi,klogy^i,k,formulae-sequencesubscript𝑖LMsuperscriptsubscript𝑡𝑖superscript𝜃1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑘1𝐾subscript𝑦𝑖𝑘subscript^𝑦𝑖𝑘h_{i}=\text{LM}(t_{i}^{*},\theta^{*}),\quad\mathcal{L}=-\frac{1}{N}\sum_{i=1}^% {N}\sum_{k=1}^{K}y_{i,k}\log\hat{y}_{i,k},italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = LM ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , caligraphic_L = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT roman_log over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , (26)

where N,K𝑁𝐾N,Kitalic_N , italic_K are the number of training nodes and classes, yi,ksubscript𝑦𝑖𝑘y_{i,k}italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT is the ground-truth, y^i,ksubscript^𝑦𝑖𝑘\hat{y}_{i,k}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT is the model prediction.

4.3 Graph-Enhanced Robust Classifier

We employ a dual-GNN framework to tackle edge sparsity: one GNN learns enhanced graph structures with similarity, and the other focuses on node classification.

Specifically, with the nodes’ representationsN×dsuperscript𝑁𝑑\mathcal{H}\in\mathbb{R}^{N\times d}caligraphic_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT and the structure representation 𝒜N×Nsuperscript𝒜superscript𝑁𝑁\mathcal{A}^{*}\in\mathbb{R}^{N\times N}caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, we first compute the similarity matrix of node vector representations:

𝐒=Norm((1)(1)),(1)=GNN1(,𝒜).formulae-sequence𝐒Normsuperscript1superscriptsuperscript1topsuperscript1subscriptGNN1superscript𝒜\mathbf{S}=\text{Norm}(\mathcal{H}^{(1)}\cdot\mathcal{H}^{(1)^{\top}}),% \mathcal{H}^{(1)}=\text{GNN}_{1}(\mathcal{H},\mathcal{A}^{*}).bold_S = Norm ( caligraphic_H start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ⋅ caligraphic_H start_POSTSUPERSCRIPT ( 1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) , caligraphic_H start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = GNN start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_H , caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) . (27)

Then, we update the original adjacency matrix 𝒜superscript𝒜\mathcal{A}^{*}caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with preserving the judgment of the LLM without alteration:

𝒜~ij={𝒜ij+𝐒ij,if vi𝒱c|vj𝒱c,𝒜ij,else.subscriptsuperscript~𝒜𝑖𝑗casessubscriptsuperscript𝒜𝑖𝑗subscript𝐒𝑖𝑗if subscript𝑣𝑖conditionalsubscript𝒱𝑐subscript𝑣𝑗subscript𝒱𝑐subscriptsuperscript𝒜𝑖𝑗else\mathcal{\tilde{A}}^{*}_{ij}=\begin{cases}\mathcal{A}^{*}_{ij}+\mathbf{S}_{ij}% ,&\text{if }v_{i}\notin\mathcal{V}_{c}\ |\ v_{j}\notin\mathcal{V}_{c},\\ \mathcal{A}^{*}_{ij},&\text{else}.\end{cases}over~ start_ARG caligraphic_A end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + bold_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , end_CELL start_CELL if italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∉ caligraphic_V start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , end_CELL start_CELL else . end_CELL end_ROW (28)

We use the updated matrix as the input of GNN2()subscriptGNN2\text{GNN}_{2}(\cdot)GNN start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ) and jointly optimize GNN1()subscriptGNN1\text{GNN}_{1}(\cdot)GNN start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) and GNN2()subscriptGNN2\text{GNN}_{2}(\cdot)GNN start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ) using cross entropy:

(2)=GNN2(,𝒜~),=1Ni=1Nk=1Kyi,klogy^i,k,formulae-sequencesuperscript2subscriptGNN2superscript~𝒜1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑘1𝐾subscript𝑦𝑖𝑘subscript^𝑦𝑖𝑘\mathcal{H}^{(2)}=\text{GNN}_{2}(\mathcal{H},\mathcal{\tilde{A}}^{*}),\mathcal% {L}=-\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K}y_{i,k}\log\hat{y}_{i,k},caligraphic_H start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT = GNN start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_H , over~ start_ARG caligraphic_A end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , caligraphic_L = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT roman_log over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT , (29)

where N𝑁Nitalic_N is nodes’ number, and K𝐾Kitalic_K is classes’ number.

Table 1: The Comparison of Different LLM-enhanced TAG Learning Methods under UltraTAG. The top four methods use LMs, while the bottom five use LLMs. ’XMC’ is eXtream Multi-label Classification, ’Iteration’ is iterative training with LM and GNN using pseudo labels, ’Joint’ means joint training with multiple GNNs, ’Combined Loss’ means using a custom loss function, ’BoW’ is bag of words.

Method Data Augmentation Text Encoder Encoder Supervision Traing Mechanism
GLEM DeBERTa Node Classification Iteration
GIANT BERT XMC Only GNN
G2P2 RoBERTa Node Classification Combined Loss
SimTeG e5-large/RoBERTa Node Classification Only GNN
TAPE DeBERTa Node Classification Only GNN
ENGINE LLaMA2-7B / Joint
LLMGNN BoW / Only GNN
GraphAdapter LLaMA2-13B Token Prediction Only GNN
UltraTAG-S BERT Node Classification Joint
Table 2: The Comparison of Different LLM-enhanced TAG Learning Methods for Sparse Scenarios Robustness. ’Input’, ’Node’, ’Edge’, ’Training’ denote the consideration of input robustness, node robustness, edge robustness and training robustness.
Robustness Input Node Edge Training
GLEM
GIANT
G2P2
SimTeG
TAPE
ENGINE
LLMGNN
GraphAdapter
UltraTAG-S

5 Experiments

In this section, we analyze the effectiveness of UltraTAG-S through experimental evaluation. To comprehensively assess the performance of our approach, we address the following questions: Q1: What are the differences between existing TAGs learning methods under UltraTAG? Q2: What is the performance of UltraTAG-S as a general and robust TAGs learning paradigm in ideal and sparse scenarios? Q3: What factors contribute to the performance and robustness of UltraTAG-S? Q4: What is the training time complexity and hyperparameter setting of UltraTAG-S? Details of the datasets and baselines are in Appendix A and  B.

5.1 Paradigm Comparision

In this section, we compare the similarities and differences of the current LLM-enhanced TAG learning methods under the framework of UltraTAG from four aspects, namely Data Augmentation, Text Encoder, Encoder Supervision, and Training Mechanism, as shown in Table 1. LM-based methods (Zhao et al., 2023; Chien et al., 2022; Wen & Fang, 2023; Duan et al., 2023) utilize distinct language models and fine-tuning tasks, while LLM-based methods (He et al., 2024a; Chen et al., 2024) focus on data augmentation. Meanwhile, anyone can optimize one small module among the four modules of UltraTAG to form a new baseline.

We also compare the sparse scenarios robustness of Different LLM-enhanced TAG Learning Methods from four aspects, namely Input Robustness, Node Robustness, Edge Robustness and Training Robustness, which means whether these methods consider the robustness of data and performance across the above four dimensions. The comparison is shown as Table 2, we find that only UltraTAG-S(Ours) takes into account the robustness across all dimensions.

5.2 Performance and Robustness Comparision

Refer to caption
Figure 4: Robustness Comparison in Sparse Scenarios. The horizontal coordinate represents the sparse ratio of nodes and edges, and the vertical coordinate represents the accuracy of the node classification task.
Table 3: Experimental results of node classification. Optimal performance is in bold and sub-optimal performance is underlined.
Methods Cora CiteSeer PubMed WikiCS Instagram Reddit Elo-Photo
MLP 54.94±3.68 61.91±1.67 52.13±7.35 62.46±0.70 51.85±10.78 52.91±1.81 47.45±9.18
GCN 74.91±8.71 69.00±2.83 72.54±6.95 73.27±4.62 63.66±1.11 55.00±4.34 69.49±3.93
GAT 71.70±2.75 70.31±1.01 75.86±1.08 68.72±0.90 64.80±0.22 60.36±0.25 64.22±2.58
GCNII 77.23±0.66 71.91±1.05 73.28±1.67 70.12±1.81 65.07±0.59 62.78±0.49 60.60±1.36
GraphSAGE 81.70±1.00 66.68±0.80 68.41±9.59 75.16±0.33 59.65±5.78 53.59±2.24 70.48±6.03
BERT 79.70±0.32 76.88±0.41 90.95±0.11 71.70±1.09 63.50±0.09 58.78±0.05 70.01±0.08
DeBERTa 73.39±4.54 75.16±1.08 90.81±0.20 68.18±4.10 62.40±0.59 59.92±0.45 70.18±0.18
RoBERTa 80.35±0.48 77.04±1.49 91.13±0.11 72.12±0.70 64.67±0.34 59.23±0.06 70.25±0.34
GLEM 87.07±1.01 76.30±2.45 89.56±1.65 74.83±0.95 65.90±0.36 60.88±0.03 77.74±0.27
SimTeG 88.75±0.42 77.37±0.64 88.31±0.75 76.32±0.53 64.29±0.19 61.60±0.88 79.82±0.21
TAPE 89.07±0.56 77.02±0.71 90.38±0.99 80.17±0.18 65.44±0.35 63.01±0.82 82.26±0.64
ENGINE 86.79±0.58 78.03±0.48 91.43±0.13 81.38±0.38 66.27±0.41 62.57±0.13 83.06±0.22
UltraTAG-S 90.96±0.45 78.68±0.21 92.41±0.30 83.05±0.16 66.69±0.14 63.78±0.30 84.70±0.03
Refer to caption
Figure 5: Robustness Comparison among All Datasets in Sparse Ratio of 20%, 50% and 80%.

We conduct a comprehensive evaluation of UltraTAG-S by comparing with GNN-only, LM-only and LLM-GNN methods, as the results in Table 3. Since GNN-only methods cannot accept texts as input, in order to make a fair comparison, we encode these methods using a unified BERT (Devlin et al., 2019) to get unified representation as input. As can be seen from Table 3, the performance of UltraTAG-S on all datasets is better than that of the current existing methods, and the improvement of the effect is up to 2.21%.

In order to simulate the challenges of the sparse scene, we randomly delete the texts and edges of nodes in a ratio of 20%, 50%, and 80% without considering data noise or additional constraints. As illustrated in Figure 4, our proposed method, UltraTAG-S, demonstrates the best robustness compared with current TAG learning baselines. Specifically, our method maintains the smallest decline in classification accuracy under sparse scenarios with superior adaptability.

The details of performance in sparse ratio of 80% is shown in Table 4. As can be seen from the results in sparse scenarios, our proposed UltraTAG-S can also achieve SOTA node classification accuracy in extremely sparse scenarios 80%, and the performance enhancement of UltraTAG-S is up to 17.5% in sparse ratio of 80%. The details with sparse ratio of 20% and 50% are shown in Appendix E Table 910.

As shown in Figure 5, UltraTAG-S consistently achieves the highest accuracy across all datasets under varying sparsity levels, demonstrating optimal robustness. And the robustness improves significantly as data sparsity increases, highlighting the effectiveness in extreme data sparsity.

Table 4: Robustness Comparison in Sparse Scenarios with Ratio of 80%, which means nodes’ texts and edges with proportion of 80% are removed randomly to simulate real-world scenario. Optimal performance is in bold and sub-optimal performance is underlined.
Methods Cora CiteSeer PubMed WikiCS Instagram Reddit Elo-Photo
MLP 30.41±0.59 27.74±0.59 44.25±1.68 26.64±1.17 62.54±1.02 50.93±0.41 46.21±0.61
GCN 41.96±0.73 30.53±0.68 49.68±1.31 54.24±2.29 63.20±0.31 54.38±1.88 51.27±0.33
GAT 38.86±0.59 30.50±0.27 52.03±0.30 52.85±0.71 63.17±0.86 56.41±0.25 50.35±0.85
GCNII 36.79±0.28 31.07±0.97 51.33±0.37 50.83±1.32 62.27±0.84 57.62±1.10 45.53±0.15
GraphSAGE 37.60±0.67 31.69±0.44 50.59±0.71 53.03±0.71 61.70±0.93 58.72±0.81 52.17±0.65
BERT 37.59±0.08 31.50±0.54 49.95±0.04 28.58±1.24 62.50±0.43 51.40±0.15 49.59±0.04
DeBERTa 29.98±1.09 30.80±0.55 42.34±4.11 21.83±1.06 63.59±0.27 50.24±0.28 47.96±1.47
RoBERTa 28.23±0.00 23.32±4.55 47.24±4.20 20.36±0.40 63.68±0.00 50.32±0.21 49.52±0.12
GLEM 49.01±0.58 36.64±1.46 51.48±0.54 52.41±0.76 61.54±0.56 50.82±1.04 56.25±2.14
SimTeG 45.78±0.22 30.40±0.66 54.95±0.61 50.35±0.72 60.61±0.16 58.08±0.12 55.73±0.84
TAPE 47.08±0.20 29.77±0.28 54.87±0.50 59.83±0.77 61.25±0.59 58.10±0.72 59.76±0.12
ENGINE 42.32±0.66 35.70±0.19 54.74±0.09 49.42±0.45 63.88±0.20 57.54±0.77 57.96±0.13
UltraTAG-S 57.57±1.38 40.08±0.45 61.05±0.49 65.60±0.34 64.78±0.67 59.85±0.01 68.79±0.07

5.3 Ablation Study

In this part, we perform an ablation study on the CiteSeer, and PubMed datasets to verify the validity and robustness of each UltraTAG-S module, particularly in sparse scenarios. The results for PubMed and CiteSeer are illustrated in Figure 6. It is evident that each module of UltraTAG-S significantly contributes to the improvement of model performance and robustness.

Specifically, the Text Augmentation module enhances the model’s ability to generalize by introducing diverse textual variations, leading to a performance improvement of up to 16.89% on the CiteSeer dataset and 55.45% on PubMed. This module is particularly effective in scenarios where textual diversity is limited, as it enriches the input data and reduces overfitting. The Structure Augmentation module further contributes to the model’s robustness by optimizing the graph structure, achieving performance improvements of 3.07% on CiteSeer and 5.90% on PubMed, especially in sparse data scenarios. As for the Structure Learning module, it demonstrates even more substantial gains, with improvements of 32.49% on CiteSeer and 40.09% on PubMed, highlighting its ability to capture complex relationships within the graph. It is evident that the Structure Learning module plays the most significant role in enhancing both the effectiveness and robustness of UltraTAG-S, as it not only improves accuracy but also ensures stable performance across varying data conditions. These results underscore the importance of combining these modules to achieve optimal performance in graph-based learning tasks.

Refer to caption
Figure 6: Ablation Study on PubMed and CiteSeer. The x-axis represents the modules in the ablation study, where ’w/o TA’, ’w/o SA’, ’w/o SL’ denote the removal of Text Augmentation module, Structure Augmentation module and Structure Learning module, respectively. The y-axis represents accuracy in different ratios.

5.4 Complexity Analysis and Hyperparameter Setting

The computational complexity of our proposed method is primarily determined by two GNN operations. The first GNN calculates the similarity matrix 𝐒N×N𝐒superscript𝑁𝑁\mathbf{S}\in\mathbb{R}^{N\times N}bold_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT and updates the adjacency matrix 𝒜ijsubscriptsuperscript𝒜𝑖𝑗\mathcal{A}^{*}_{ij}caligraphic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. This step involves pairwise computations between nodes, leading to a complexity of O(N2)𝑂superscript𝑁2O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The second GNN performs node classification using the updated adjacency matrix 𝒜~ijsubscriptsuperscript~𝒜𝑖𝑗\mathcal{\tilde{A}}^{*}_{ij}over~ start_ARG caligraphic_A end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and node features \mathcal{H}caligraphic_H. With m𝑚mitalic_m layers and =O(N2)𝑂superscript𝑁2\mathcal{E}=O(N^{2})caligraphic_E = italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) edges in the graph, the complexity for this operation scales as O(mN2)𝑂𝑚superscript𝑁2O(m\cdot N^{2})italic_O ( italic_m ⋅ italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Therefore, the total computational cost per epoch is dominated by these two steps, resulting in an overall complexity of O(mN2)𝑂𝑚superscript𝑁2O(m\cdot N^{2})italic_O ( italic_m ⋅ italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

For detailed parameter settings, we employ five random seeds {42, 43, 44, 45, 46}. The GNN1()subscriptGNN1\text{GNN}_{1}(\cdot)GNN start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) and GNN2()subscriptGNN2\text{GNN}_{2}(\cdot)GNN start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ) use the Adam optimizer for joint optimizing with a learning rate of 1e-2, weight decay of 5e-4, dropout of 0.5, and 100 epochs. Each GNN consists of 2 layers, with a similarity calculation threshold of 0.8. The number of important nodes selected by PageRank is 10% of the total training nodes, and the acceptance threshold for LLM-based edge reconfiguration is 0.5. For fine-tuning the LM, the learning rate is set to 5e-5, with 3 epochs, a batch size of 8, and a dropout of 0.3. The LLM used is Meta-Llama-3-8B-Instruct. All experiments were conducted on a system equipped with a NVIDIA A100 80GB PCIe GPU and an Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, with CUDA Version 12.4.

6 Conclusion and Future Work

In response to the current LLM-enhanced TAG Learning methods, we first propose UltraTAG as a unified and domain-adaptive pipeline learning framework. Simultaneously, to address the challenges faced by existing LLM-enhanced TAG learning methods in real-world sparse scenarios, such as nodes’ texts missing or edges missing, we introduce UltraTAG-S, a TAG learning paradigm specifically tailored for sparse scenarios. UltraTAG-S effectively resolves the issues of nodes’ texts sparsity and edge sparsity in real-world settings through LLM-based text propagation strategy and text augmentation strategy, as well as PageRank and LLM-based graph structure learning strategies, achieving state-of-the-art performance in both ideal and sparse scenarios. In the future, we will further explore the pivotal role of text propagation strategies in TAG representation learning.

References

  • Brown et al. (2020) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
  • Chen et al. (2020) Chen, M., Wei, Z., Huang, Z., Ding, B., and Li, Y. Simple and deep graph convolutional networks. In International Conference on Machine Learning, ICML, 2020.
  • Chen et al. (2024) Chen, Z., Mao, H., Wen, H., Han, H., Jin, W., Zhang, H., Liu, H., and Tang, J. Label-free node classification on graphs with large language models (llms), 2024. URL https://arxiv.org/abs/2310.04668.
  • Chien et al. (2022) Chien, E., Chang, W.-C., Hsieh, C.-J., Yu, H.-F., Zhang, J., Milenkovic, O., and Dhillon, I. S. Node feature extraction by self-supervised multi-scale neighborhood prediction, 2022. URL https://arxiv.org/abs/2111.00064.
  • Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/abs/1810.04805.
  • Duan et al. (2023) Duan, K., Liu, Q., Chua, T.-S., Yan, S., Ooi, W. T., Xie, Q., and He, J. Simteg: A frustratingly simple approach improves textual graph learning, 2023. URL https://arxiv.org/abs/2308.02565.
  • Giles et al. (1998) Giles, C. L., Bollacker, K. D., and Lawrence, S. Citeseer: An automatic citation indexing system. In Proceedings of the third ACM conference on Digital libraries, 1998.
  • Guo et al. (2023) Guo, D., Chu, Z., and Li, S. Fair attribute completion on graph with missing attributes, 2023. URL https://arxiv.org/abs/2302.12977.
  • Hamilton et al. (2017) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems, NeurIPS, 2017.
  • Harris (1954) Harris, Z. S. Distributional structure. 1954. URL https://api.semanticscholar.org/CorpusID:86680084.
  • He et al. (2021) He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decoding-enhanced bert with disentangled attention, 2021. URL https://arxiv.org/abs/2006.03654.
  • He et al. (2024a) He, X., Bresson, X., Laurent, T., Perold, A., LeCun, Y., and Hooi, B. Harnessing explanations: Llm-to-lm interpreter for enhanced text-attributed graph representation learning, 2024a. URL https://arxiv.org/abs/2305.19523.
  • He et al. (2024b) He, Y., Sui, Y., He, X., and Hooi, B. Unigraph: Learning a unified cross-domain foundation model for text-attributed graphs, 2024b. URL https://arxiv.org/abs/2402.13630.
  • Huang et al. (2024) Huang, X., Han, K., Yang, Y., Bao, D., Tao, Q., Chai, Z., and Zhu, Q. Can gnn be good adapter for llms?, 2024. URL https://arxiv.org/abs/2402.12984.
  • Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, ICLR, 2017.
  • Li et al. (2024a) Li, X., Liao, M., Wu, Z., Su, D., Zhang, W., Li, R.-H., and Wang, G. Lightdic: A simple yet effective approach for large-scale digraph representation learning. arXiv preprint arXiv:2401.11772, 2024a.
  • Li et al. (2024b) Li, Z., Li, R.-H., Liao, M., Jin, F., and Wang, G. Privacy-preserving graph embedding based on local differential privacy, 2024b. URL https://arxiv.org/abs/2310.11060.
  • Liu et al. (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach, 2019. URL https://arxiv.org/abs/1907.11692.
  • Mernyei & Cangea (2020) Mernyei, P. and Cangea, C. Wiki-cs: A wikipedia-based benchmark for graph neural networks. arXiv preprint arXiv:2007.02901, 2020.
  • Mikolov et al. (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. Distributed representations of words and phrases and their compositionality, 2013. URL https://arxiv.org/abs/1310.4546.
  • Ni et al. (2019) Ni, J., Li, J., and McAuley, J. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proc. of EMNLP, 2019.
  • Pan et al. (2024) Pan, B., Zhang, Z., Zhang, Y., Hu, Y., and Zhao, L. Distilling large language models for text-attributed graph learning, 2024. URL https://arxiv.org/abs/2402.12022.
  • Rossi et al. (2022) Rossi, E., Kenlay, H., Gorinova, M. I., Chamberlain, B. P., Dong, X., and Bronstein, M. On the unreasonable effectiveness of feature propagation in learning on graphs with missing node features, 2022. URL https://arxiv.org/abs/2111.12128.
  • Sen et al. (2008) Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. AI magazine, 2008.
  • Singh & Sachan (2014) Singh, G. and Sachan, M. Multi-layer perceptron (mlp) neural network technique for offline handwritten gurmukhi character recognition. In 2014 IEEE International Conference on Computational Intelligence and Computing Research, pp.  1–5, 2014. doi: 10.1109/ICCIC.2014.7238334.
  • Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations, ICLR, 2018.
  • Wang et al. (2024) Wang, Y., Zhu, Y., Zhang, W., Zhuang, Y., Li, Y., and Tang, S. Bridging local details and global context in text-attributed graphs, 2024. URL https://arxiv.org/abs/2406.12608.
  • Wen & Fang (2023) Wen, Z. and Fang, Y. Augmenting low-resource text classification with graph-grounded pre-training and prompting. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, pp.  506–516. ACM, July 2023. doi: 10.1145/3539618.3591641. URL http://dx.doi.org/10.1145/3539618.3591641.
  • Yan et al. (2023) Yan, H., Li, C., Long, R., Yan, C., Zhao, J., Zhuang, W., Yin, J., Zhang, P., Han, W., Sun, H., et al. A comprehensive study on text-attributed graphs: Benchmarking and rethinking. In Proc. of NeurIPS, 2023.
  • Zhang et al. (2022) Zhang, W., Yin, Z., Sheng, Z., Li, Y., Ouyang, W., Li, X., Tao, Y., Yang, Z., and Cui, B. Graph attention multi-layer perceptron. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, pp.  4560–4570. ACM, August 2022. doi: 10.1145/3534678.3539121. URL http://dx.doi.org/10.1145/3534678.3539121.
  • Zhao et al. (2023) Zhao, J., Qu, M., Li, C., Yan, H., Liu, Q., Li, R., Xie, X., and Tang, J. Learning on large-scale text-attributed graphs via variational inference, 2023. URL https://arxiv.org/abs/2210.14709.
  • Zhu et al. (2024) Zhu, Y., Wang, Y., Shi, H., and Tang, S. Efficient tuning and inference for large language models on textual graphs, 2024. URL https://arxiv.org/abs/2401.15569.

Appendix A Datasets

This section provides a detailed introduction to the datasets used in the main content. The statistics of the TAG datasets we use is as shown in Table 5. The details of each dataset are as follows:

Table 5: Statistics of the TAG datasets. The datasets are partitioned in Train-Val-Test-Out mode, ’Out’ is data that not involved in the partitioning of training, validation, or test sets. All datasets are evaluated by node classification accuracy.

Dataset #Nodes #Edges #Classes #Split Ratio(%)
Cora 2,708 5,278 7 60-20-20-0
CiteSeer 3,186 4,277 6 60-20-20-0
PubMed 19,717 44,324 3 60-20-20-0
WikiCS 11,701 215,863 10 5-15-50-30
Instagram 11,339 144,010 2 10-10-80-0
Reddit 33,434 198,448 2 10-10-80-0
Elo-Photo 48,362 873,793 12 40-15-45-0

Cora (Sen et al., 2008) dataset comprises 2,708 scientific publications, which are classified into seven categories: Case-based, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, and Theory. Each publication in this citation network either cites or is cited by at least one other publication, forming a total of 5,278 edges. For our study, we utilize the dataset with raw texts provided by TAPE (He et al., 2024a), available at the following repository111Cora Dataset.

CiteSeer (Giles et al., 1998) dataset contains 3,186 scientific publications, categorized into six classes: Agents, Machine Learning, Information Retrieval, Databases, Human-Computer Interaction, and Artificial Intelligence. The objective is to predict the category of each publication using its title and abstract.

PubMed (Sen et al., 2008) dataset comprises 19,717 scientific publications from the PubMed database related to diabetes. These publications are categorized into three classes: Experimentally Induced Diabetes, Type 1 Diabetes, and Type 2 Diabetes. The associated citation network contains a total of 44,324 links.

WikiCS (Mernyei & Cangea, 2020) dataset is a Wikipedia-based resource developed for benchmarking Graph Neural Networks. It is derived from Wikipedia categories and includes 10 classes representing various branches of computer science, characterized by a high degree of connectivity. The node features are extracted from the text of the associated articles. The raw text for each node is obtained from the following repository 222WikiCS Dataset.

Instagram (Huang et al., 2024) dataset serves as a social network where nodes represent users and edges correspond to following relationships. The classification task involves distinguishing between commercial and normal users within this network.

Reddit (Huang et al., 2024) dataset is a social network where nodes represent users, and node features are derived from the content of users’ historically published subreddits. Edges indicate whether two users have replied to each other. The classification task involves determining whether a user belongs to the top 50% in popularity, based on the average score of all their subreddits. This dataset is built on a public resource333Reddit Dataset, which collected replies and scores from Reddit users. The node text features are generated from each user’s historical post content, limited to their last three posts. Users are categorized as popular or normal based on the median of average historical post scores, with those exceeding the median classified as popular and the rest as normal.

Ele-Photo (Yan et al., 2023) dataset is derived from the Amazon-Electronics dataset (Ni et al., 2019). In this dataset, nodes represent electronics-related products, and edges signify frequent co-purchases or co-views between products. Each node is labeled based on a three-level classification scheme for electronics products. User reviews serve as the textual attributes for the nodes; when multiple reviews are available for a product, the review with the highest number of votes is selected. If no such review exists, a random review is used. The task is to classify electronics products into 12 predefined categories.

Appendix B Baselines

This section contains detailed information about baselines:

MLP (Singh & Sachan, 2014) is a simple feed-forward neural network model, commonly used for baseline classification tasks. It consists of multiple layers of neurons, where each layer is fully connected to the previous one. The model is trained via backpropagation, with the final output layer producing predictions.

GCN (Kipf & Welling, 2017)is a graph-based neural network model that performs node classification tasks by aggregating information from neighboring nodes. The model is built on graph convolutional layers, where each node’s embedding is updated by combining the features of its neighbors, enabling it to capture the graph structure effectively.

GAT (Veličković et al., 2018)introduces attention mechanisms to graph convolutional networks, allowing nodes to weigh their neighbors differently when aggregating features. This attention mechanism helps GAT to focus on the most informative neighbors, making it particularly effective in graphs with heterogeneous relationships between nodes.

GCNII (Chen et al., 2020)is an improved version of the GCN model, which integrates higher-order graph convolutions and a skip connection strategy. This enhancement enables GCNII to better capture deep graph structures and mitigate the over-smoothing problem that arises in deep GCN architectures.

GraphSAGE (Hamilton et al., 2017)is an inductive framework for graph representation learning, where node embeddings are learned by sampling and aggregating features from neighbors. This model can be applied to large-scale graphs by utilizing different aggregation functions, such as mean, pooling, or LSTM-based aggregation.

BERT (Devlin et al., 2019)is a pre-trained transformer-based model that learns contextualized word embeddings by predicting missing words in a sentence. BERT’s bidirectional attention mechanism allows it to capture rich contextual information.

DeBERTa (He et al., 2021)improves upon BERT by introducing disentangled attention and enhanced decoding strategies. These innovations allow DeBERTa to better capture the relationships between different parts of the input text, leading to improved performance on multiple natural language understanding tasks.

RoBERTa (Liu et al., 2019)is an optimized version of BERT that increases training data size and model capacity, removes the Next Sentence Prediction (NSP) objective, and fine-tunes hyperparameters. These modifications lead to improved performance over BERT on many benchmark tasks, especially in natural language understanding.

GLEM (Zhao et al., 2023)is a method for learning on large TAGs. It uses a variational EM framework to alternately update LMs and GNNs, improving scalability and performance in node classification.

SimTeG (Duan et al., 2023)is a straightforward yet effective approach for textual graph learning. It first conducts parameter-efficient fine - tuning (PEFT) on LM using downstream task labels. Then, it generates node embeddings from the fine-tuned LM. These embeddings are further used by a GNN for training on the same task.

TAPE (He et al., 2024a)is an approach for TAGs representation learning. It uses LLMs to generate predictions and explanations, which are then transformed into node features by fine-tuning a smaller LM. These features are used to train a GNN.

ENGINE (Zhu et al., 2024)is an efficient tuning method for integrating LLMs and GNNs in TAGs. It attaches a tunable G-Ladder to each LLM layer to capture structural information, freezing LLM parameters to reduce training complexity. ENGINE with caching can speed up training by 12x. ENGINE (Early) uses dynamic early exit, achieving up to 5x faster inference with minimal performance loss.

G2P2 (Wen & Fang, 2023)is a model for low-resource text classification. It has two main stages. During pre-training, it jointly trains a text encoder and a graph encoder using three graph interaction-based contrastive strategies, including text-node, text-summary, and node-summary interactions, to learn a dual-modal embedding space. In downstream classification, it uses prompting. For zero-shot classification, it uses handcrafted discrete prompts, and for few-shot classification, it uses continuous prompts with graph context-based initialization.

LLMGNN (Chen et al., 2024)is a pipeline for label-free node classification on graphs. It uses LLMs to annotate nodes and GNNs for prediction. It selects nodes considering annotation difficulty, gets confidence - aware annotations, and post - filters to improve annotation quality, achieving good results at low cost.

GIANT (Chien et al., 2022)is a self-supervised learning framework for graph-guided numerical node feature extraction. It addresses the graph-agnostic feature extraction issue in standard GNN pipelines. By formulating neighborhood prediction as an XMC problem and using XR-Transformers, it fine-tunes language models with graph information.

GraphAdapter (Huang et al., 2024)is an approach that uses GNN as an efficient adapter for LLMs to model TAGs. It conducts language-structure pre-training to jointly learn with frozen LLMs, integrating structural and textual information. After pre-training, it can be fine-tuned with prompts for downstream tasks. Experiments show it outperforms baselines on multiple datasets.

Appendix C LLM Prompts

The LLM employed in our study is Meta-Llama-3-8B-Instruct, which is utilized for both text augmentation and structure augmentation tasks. This section provides a comprehensive overview of all the prompts we designed and implemented. Each prompt follows a consistent structure comprising ”Dataset Description + Question,” where the dataset description serves to contextualize the query and ensure clarity. The detailed dataset descriptions are:

Cora:Now, here is a paper from the Cora dataset. This paper falls into one of seven categories: Case-based, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, and Theory.
CiteSeer:Now, here is a paper from the Citeseer dataset. This paper falls into one of six categories: Agents, Machine Learning, Information Retrieval, Databases, Human-Computer Interaction, or Artificial Intelligence.
PubMed:The following is a paper from the PubMed dataset, which contains 19,717 scientific publications related to diabetes. These publications are categorized into three classes: Experimentally Induced Diabetes, Type 1 Diabetes, and Type 2 Diabetes.
WikiCS:Here is an article from the WikiCS dataset. This dataset is a Wikipedia-based resource developed for benchmarking Graph Neural Networks (GNNs). It is derived from Wikipedia categories and includes 10 classes representing various branches of computer science, characterized by a high degree of connectivity. The 10 classes are Computational Linguistics, Databases, Operating Systems, Computer Architecture, Computer Security, Internet Protocols, Computer File Systems, Distributed Computing Architectures, Web Technologies, and Programming Languages.
Instagram:This is a post from Instagram, a social network where edges represent following relationships and nodes represent users. The task is to classify users into two categories: commercial and normal.
Reddit:This is a post from the Reddit dataset, a social network where nodes represent users, and node features are derived from the content of users’ historically published subreddits. Edges represent whether two users have replied to each other. The task is to classify users as belonging to the top 50 percent in popularity, based on the average score of all their subreddits. Node text features are generated from the content of each user’s last three posts. Users are categorized as ’popular’ or ’normal’ based on the median of their average historical post scores, with those above the median classified as ’popular’ and the rest as ’normal’.
Elo-Photo:Here is a product review from the Elo-Potho dataset. The Elo-Potho dataset is derived from the Amazon-Electronics dataset. In this dataset, nodes represent electronics products, and edges indicate frequent co-purchases or co-views between products. Each node is labeled according to a three-level classification scheme for electronics products. User reviews serve as the textual attributes for the nodes; when multiple reviews are available for a product, the review with the highest number of votes is selected. If no such review exists, a random review is used. The task is to classify electronics products into 12 predefined categories. The categories are: Amazon Echo, Camera, Cell Phones, Clothing, Computers, Home and Kitchen, Laptops, Music, Office Supplies, Personal Care, Shoes, Sports and Outdoors.

For different inference scenarios, our approach to querying the LLM is as follows. Due to the uniformity of the query format, the following demonstrations of Key Words and Soft Labels etc. use the Cora dataset as an example:

Key Words:Please help me identify the five keywords from its title and abstract that are most relevant for classification, and directly output the keywords. The title and abstract of the paper are as follows:<Title><Abstract> Soft Labels:Based on its title and abstract, please predict the most appropriate label for this paper and provide only the label as your response. The title and abstract of the paper are as follows:<Title><Abstract> Summary:Please summarize the title and abstract to improve their suitability for the classification task. Output only the summary text, without including any irrelevant content. The title and abstract of the paper are as follows:<Title><Abstract> Edge Reconfigure:You are provided with the text information of two nodes and their predicted category pseudo-label. Use this information to evaluate whether an edge should exist between the two nodes, and return a probability value between 0 and 1 representing the likelihood of the edge’s existence. Only output the probability value, without any additional or irrelevant content. As for Node 1: <Title 1><Abstract 1>. Your prediction label is <SoftLabel 1>; As for Node 2: <Title 2><Abstract 2>. Your prediction label is <SoftLabel 2>.

Appendix D Ablation Study and Backbones

This section presents more detailed ablation study results. We conducted ablation study under all sparse conditions on the Cora, CiteSeer, and PubMed datasets, as shown in Table 7. Furthermore, we tested the impact of different language model backbones on node classification accuracy, performing comparative experiments under all sparse conditions on the Cora, CiteSeer and PubMed datasets as well, as shown in Table 6. From the table, it can be observed that under conditions of lower data sparsity, using the BERT model for text encoding yields superior classification accuracy. However, when the data sparsity reaches as high as 80%, employing the RoBERTa model for text encoding results in better classification accuracy.

Additionally, we briefly explored the impact of different text propagation strategies on the model’s classification accuracy, with the experimental results presented in Table 8. From the table, it is evident that enhancing the original text data with more effective LLM-augmented texts can significantly improve the performance across all sparse conditions.

Table 6: Performance Comparison with Different Text Encoders in ideal scenarios and sparse scenarios in ratio of 20%, 50% and 80%. The optimal performance is in bold.
Encoder Cora CiteSeer PubMed
Sparse Ratio 0% 0% 0%
BERT 90.96±0.45 78.68±0.21 91.99±1.21
DeBERTa 83.39±0.39 76.02±1.45 92.06±1.77
RoBERTa 88.38±1.82 77.74±0.76 92.41±0.30
Sparse Ratio 20% 20% 20%
BERT 88.93±0.74 77.12±0.28 89.38±0.29
DeBERTa 81.55±0.59 72.88±0.61 89.33±0.11
RoBERTa 88.75±0.22 74.92±0.46 89.88±0.21
Sparse Ratio 50% 50% 50%
BERT 83.95±0.80 67.08±0.28 80.76±0.26
DeBERTa 73.99±0.45 62.70±1.08 80.53±0.21
RoBERTa 80.81±0.54 65.05±0.14 80.91±0.29
Sparse Ratio 80% 80% 80%
BERT 57.57±1.38 40.08±0.45 60.70±0.74
DeBERTa 52.21±0.76 39.03±0.64 60.62±0.76
RoBERTa 54.61±1.06 38.87±0.88 61.05±0.49
Table 7: Detailed Performance Comparison of Ablation Study. ’w/o TA’ means without Text Augmentation module, ’w/o SA’ means without Structure Augmentation module, ’w/o SL’ means without Structure Learning module.
Method Cora CiteSeer PubMed
Sparse Ratio 0% 0% 0%
w/o TA 89.59±1.39 77.37±0.96 91.29±0.90
w/o SA 90.59±0.86 78.37±1.63 91.99±1.75
w/o SL 88.38±0.85 77.59±1.44 86.36±1.23
UltraTAG-S 90.96±0.45 78.68±0.21 92.41±0.30
Sparse Ratio 20% 20% 20%
w/o TA 86.52±0.21 75.12±0.45 88.38±1.50
w/o SA 87.93±1.72 77.01±0.31 89.18±0.58
w/o SL 86.72±1.33 68.34±0.90 85.42±1.34
UltraTAG-S 88.93±0.74 77.12±0.28 89.88±0.21
Sparse Ratio 50% 50% 50%
w/o TA 78.41±0.32 59.94±1.07 52.05±0.45
w/o SA 81.95±1.30 65.08±1.06 78.76±0.22
w/o SL 74.54±0.54 50.63±1.42 69.17±0.71
UltraTAG-S 83.95±0.80 67.08±0.28 80.91±0.29
Sparse Ratio 80% 80% 80%
w/o TA 48.60±1.83 34.29±1.42 44.93±1.22
w/o SA 56.83±1.45 39.50±0.54 57.65±1.21
w/o SL 38.01±1.08 35.71±0.04 43.58±0.45
UltraTAG-S 57.57±1.38 40.08±0.45 61.05±0.49
Table 8: Performance Comparison with Different Augmentation Texts Generated by LLM and Different Text Aggregator Strategies. ’OT’ original texts, ’+’ means concatenate.
Texts Cora CiteSeer PubMed
Sparse Ratio 0% 0% 0%
OT 89.48±0.56 75.24±1.66 90.62±0.62
OT+Su 90.22±1.23 77.59±0.24 91.89±1.19
OT+KW 90.41±0.89 77.59±1.35 91.84±0.94
OT+SL 89.48±1.78 77.74±0.31 91.99±0.29
OT+SKWSL 90.96±0.45 78.68±0.21 92.41±0.30
Sparse Ratio 20% 20% 20%
OT 88.12±0.34 74.92±0.71 87.78±0.74
OT+Su 88.76±0.67 77.12±1.88 89.12±1.08
OT+KW 88.43±1.12 77.27±0.42 89.48±0.95
OT+SL 88.53±0.98 75.86±1.53 89.38±0.27
OT+SKWSL 88.93±0.74 77.12±0.28 89.88±0.21
Sparse Ratio 50% 50% 50%
OT 82.47±0.21 64.89±1.21 79.54±0.39
OT+Su 82.66±0.76 65.52±1.65 80.78±0.61
OT+KW 83.58±0.43 65.83±0.25 80.65±0.90
OT+SL 83.21±0.65 66.46±0.41 80.78±0.28
OT+SKWSL 83.95±0.80 67.08±0.28 80.91±0.29
Sparse Ratio 80% 80% 80%
OT 54.98±0.91 39.34±0.96 60.40±0.77
OT+Su 57.01±1.77 39.66±0.26 60.83±1.07
OT+KW 57.43±0.32 39.97±0.73 60.78±0.60
OT+SL 57.38±0.68 39.81±0.40 60.70±0.82
OT+SKWSL 57.57±1.38 40.08±0.45 61.05±0.49

Appendix E Robustness Comparison

This section contains the performance of various existing methods for node classification tasks on multiple datasets in multiple sparse scenarios. The results of 20% and 50% sparse ratio are shown in Table 9 and 10, respectively. As shown in the table, under 20% and 50% data sparsity conditions, UltraTAG-S still achieves the best node classification performance and robustness among all existing methods with up to 4.6% and 15.4% performance improvement.

Table 9: Robustness Comparison in Sparse Scenarios with Ratio of 20%, which means nodes’ texts and edges with proportion of 20% are removed randomly to simulate real-world scenario. Optimal performance is in bold and sub-optimal performance is underlined.
Methods Cora CiteSeer PubMed WikiCS Instagram Reddit Elo-Photo
MLP 40.70±3.10 48.37±5.08 59.59±4.25 42.27±4.84 62.18±3.75 54.68±2.05 48.51±2.31
GCN 75.50±3.22 64.86±1.31 75.57±5.08 72.33±3.94 60.77±4.89 52.61±2.05 70.85±4.15
GAT 68.67±2.06 65.64±1.01 72.93±0.92 68.10±0.68 64.90±0.35 57.43±1.24 61.05±0.85
GCNII 71.77±0.88 67.43±0.61 73.75±0.79 68.54±0.84 64.68±0.35 61.08±0.36 58.58±0.84
GraphSAGE 73.80±3.29 60.09±2.06 73.39±5.02 66.50±2.95 56.92±7.35 59.03±1.34 63.06±8.04
BERT 69.37±0.32 63.87±0.14 83.22±0.01 61.35±0.69 63.87±0.11 57.31±0.23 65.48±0.00
DeBERTa 58.21±8.94 53.53±1.96 82.78±0.35 45.93±3.86 61.65±1.07 56.20±3.48 65.34±0.45
RoBERTa 69.88±0.72 65.13±0.41 83.37±0.03 60.98±0.45 64.84±0.17 50.02±0.04 64.89±0.08
GLEM 85.71±2.01 68.71±1.54 82.36±0.37 72.59±3.01 63.37±0.18 55.82±0.21 72.75±1.94
SimTeG 82.34±0.74 70.10±0.60 85.65±0.41 71.33±0.13 62.46±0.62 60.28±0.35 75.46±0.28
TAPE 87.78±0.53 71.11±0.39 87.25±0.75 78.97±0.22 62.17±0.92 61.08±0.66 80.81±0.41
ENGINE 86.27±0.67 73.70±0.33 88.10±0.13 79.43±0.25 65.53±0.22 61.45±0.38 80.34±0.09
UltraTAG-S 88.93±0.74 77.12±0.28 89.88±0.21 81.72±0.18 66.08±0.70 61.61±0.12 83.17±0.06
Table 10: Robustness Comparison in Sparse Scenarios with Ratio of 50%, which means nodes’ texts and edges with proportion of 50% are removed randomly to simulate real-world scenario. Optimal performance is in bold and sub-optimal performance is underlined.
Methods Cora CiteSeer PubMed WikiCS Instagram Reddit Elo-Photo
MLP 36.35±2.90 40.38±1.68 49.14±4.41 34.29±3.24 62.65±2.18 52.09±1.25 46.57±1.75
GCN 64.61±2.81 48.71±2.16 63.71±0.91 64.12±9.69 58.96±9.54 55.48±3.03 62.61±3.70
GAT 60.63±1.80 51.50±0.61 66.82±0.80 64.60±0.37 64.51±0.23 56.56±2.00 59.32±0.92
GCNII 62.36±0.87 49.66±0.73 67.01±0.67 63.62±0.41 64.08±0.23 59.27±1.15 53.77±1.58
GraphSAGE 54.76±3.89 47.68±0.84 64.74±5.26 64.56±1.17 62.09±3.04 60.14±0.83 62.61±2.51
BERT 55.58±0.24 48.08±0.07 66.28±0.26 45.21±0.81 63.50±0.09 54.35±0.06 57.41±0.04
DeBERTa 34.41±6.99 43.81±3.29 65.75±0.31 32.91±0.73 63.73±0.39 50.90±1.65 57.39±0.19
RoBERTa 53.09±0.40 44.32±1.02 66.28±0.04 42.73±2.54 63.07±0.18 50.97±1.77 57.25±0.00
GLEM 64.84±1.63 53.43±0.25 70.51±1.95 67.07±3.08 62.43±0.21 53.85±0.08 65.25±1.58
SimTeG 72.06±0.59 58.11±0.40 76.00±0.55 65.34±0.46 61.55±0.79 59.84±0.67 67.76±0.45
TAPE 78.73±0.34 54.31±0.78 77.18±0.46 73.62±0.63 61.40±0.26 60.18±0.19 76.21±0.69
ENGINE 70.85±0.48 56.30±0.12 75.42±0.15 71.72±0.47 64.74±0.05 60.18±0.18 73.40±0.23
UltraTAG-S 83.95±0.80 67.08±0.28 80.91±0.29 77.45±0.33 65.61±0.12 60.34±0.21 79.21±0.06