Toward General and Robust LLM-enhanced Text-attributed Graph Learning

Zihao Zhang Xunkai Li Rong-Hua Li Bing Zhou Zhenjun Li Guoren Wang

Abstract

Recent advancements in Large Language Models (LLMs) and the proliferation of Text-Attributed Graphs (TAGs) across various domains have positioned LLM-enhanced TAG learning as a critical research area. By utilizing rich graph descriptions, this paradigm leverages LLMs to generate high-quality embeddings, thereby enhancing the representational capacity of Graph Neural Networks (GNNs). However, the field faces significant challenges: (1) the absence of a unified framework to systematize the diverse optimization perspectives arising from the complex interactions between LLMs and GNNs, and (2) the lack of a robust method capable of handling real-world TAGs, which often suffer from texts and edge sparsity, leading to suboptimal performance.

To address these challenges, we propose UltraTAG, a unified pipeline for LLM-enhanced TAG learning. UltraTAG provides a unified comprehensive and domain-adaptive framework that not only organizes existing methodologies but also paves the way for future advancements in the field. Building on this framework, we propose UltraTAG-S, a robust instantiation of UltraTAG designed to tackle the inherent sparsity issues in real-world TAGs. UltraTAG-S employs LLM-based text propagation and text augmentation to mitigate text sparsity, while leveraging LLM-augmented node selection techniques based on PageRank and edge reconfiguration strategies to address edge sparsity. Our extensive experiments demonstrate that UltraTAG-S significantly outperforms existing baselines, achieving improvements of 2.12% and 17.47% in ideal and sparse settings, respectively. Moreover, as the data sparsity ratio increases, the performance improvement of UltraTAG-S also rises, which underscores the effectiveness and robustness of UltraTAG-S.

Machine Learning, ICML

1 Introduction

Refer to caption — Figure 1: Performance of diffirent LLM-enhanced TAG learning methods in sparse scenarios. The horizontal axis represents the sparsity ratio of nodes and edges, while the vertical axis denotes classification accuracy. UltraTAG-S has the optimal robustness.

In recent years, the advancements in large language models (LLMs) (Brown et al., 2020) have driven the evolution of graph ML, particularly in Text-Attributed Graphs (TAGs) (He et al., 2024a), which combine nodes, edges, and textual data for applications in social networks, recommendation systems etc. While graph neural networks (GNNs) (Li et al., 2024a) excel at capturing structural information, they struggle with textual data, necessitating the integration of GNNs and LLMs for TAG learning (Zhu et al., 2024; Duan et al., 2023). Despite progress, existing TAG learning methods still face several limitations:

Limitation 1: Lack of a unified LLM-enhanced TAG learning framework. As for the current innovation directions of LLM-enhanced TAG learning are disorganized, we recapitulate them from a new perspective: (1) Preprocessing: Data Augmentation (He et al., 2024a; Chen et al., 2024; Wang et al., 2024; Pan et al., 2024), which leverages LLMs to generate enhanced textual representations like soft labels for text augmentation. (2) Feature Engineering: Improved Text Encoder (Chien et al., 2022; Duan et al., 2023), which uses LLMs/LMs to enhance node feature representation with fine-tuning models. (3) Training: Joint Training Mechanism (Zhao et al., 2023; Zhu et al., 2024; Wen & Fang, 2023; Huang et al., 2024), which enhance performance by improving the interactive training mechanism between GNNs and LMs. However, the diverse optimization strategies and goals hinder unified objectives, slowing systematic progress in TAG learning.

Solution 1: UltraTAG: A Unified Pipeline toward General and Robust LLM-enhanced TAG Learning. To address Limitation 1, we propose UltraTAG, as detailed in Sec. 3. UltraTAG is composed of three key modules: Data Augmentation, Text Encoder, and Training Mechanism, as illustrated in Figure 2. These modules integrate three key directions of LLM-enhanced TAG learning, translating innovative approaches into specific optimization objectives for UltraTAG. UltraTAG is highly adaptable to real-world scenarios, offering a flexible solution for diverse applications. Building on this, we introduce UltraTAG-X, a versatile extension tailored to specific challenges like data sparsity (UltraTAG-S), high noise (UltraTAG-N), or dynamic graphs (UltraTAG-D). This adaptability ensures UltraTAG meets complex demands, setting a new standard for LLM-enhanced TAG learning and providing a foundation for future innovations.

Limitation 2: Lack of a Robust Method. In real-world TAGs, data sparsity in nodes and edges is a common issue. For example, privacy measures (Li et al., 2024b) on social networks may restrict access to users’ information. Developing robust methods that maintain performance under such sparse conditions is a challenge. Current approaches often depend on complete text attributes, making them incompatible with sparse graphs and leading to suboptimal results. This robustness focus is specific to sparsity, which involves only missing nodes or edges, but not including data noise with error of node’s text, edge or corresponding label.

Solution 2: UltraTAG-S: An Instance of UltraTAG for Sparse Scenarios. To address Limitation 2, we propose UltraTAG-S, as detailed in Sec. 4. UltraTAG-S is composed of three key modules: (1)LLM-based Robustness Enhancement, (2)LM-based Resilient Representation Learning, and (3)Graph-Enhanced Robust Classifier, as illustrated in Figure 3. To simulate real-world sparse scenarios, we randomly remove node texts and edges from the graph according to a certain ratio. Module 1 of UltraTAG-S employs edge-based text propagation techniques and LLM-based text enhancement techniques to address node sparsity. Module 2 designs a similarity and PageRank-based important node selector and an LLM-based edge predictor to handle the edge sparsity challenge. Module 3 incorporates a graph structure learning module to further enhance structural robustness.

Our Contributions: (1)A Unified Framework. We adopt a novel perspective to systematically examine all existing methods for TAG learning and introduce UltraTAG, a unified and domain-adaptive paradigm which can extend to UltraTAG-X. (2)A Robust Method. Expanding on UltraTAG, we propose UltraTAG-S, a robust TAG learning framework designed specifically for sparse scenarios. (3)SOTA Performance. Our proposed UltraTAG-S achieves SOTA performance and optimal robustness in evaluations among 7 datasets spanning four distinct domains not only in ideal but also in sparse scenarios, exhibiting minimal performance degradation, as shown in Figure 1.

2 Related Works

2.1 Graph Learning for Data Sparsity Scenarios

For graph learning in sparse scenarios, existing research primarily focuses on addressing missing node representations, edge absences, or label deficiencies (Rossi et al., 2022; Guo et al., 2023; Zhang et al., 2022). Most of them employ vector completion techniques based on graph propagation or attention to handle these issues. However, there is still a lack of targeted research on sparse scenarios of TAGs.

2.2 Shallow Embedding Methods for TAG Learning

A common method for TAG learning is shallow embedding, text attributes are often converted into shallow features like skip-gram (Mikolov et al., 2013) or BoW (Harris, 1954), which serve as inputs for graph-based algorithms such as GCN (Kipf & Welling, 2017). While shallow embeddings are simple and efficient, they fail to capture complex semantics and nuanced relationships with limited effectiveness.

2.3 LM/LLM-based Methods for TAG Learning

With the rise of LMs like BERT (Devlin et al., 2019), researchers encode textual information in TAGs by fine-tuning LMs on downstream tasks (Zhao et al., 2023; Duan et al., 2023) or aligning LM and GNN parameter spaces via custom loss functions (Wen & Fang, 2023). The emergence of LLMs like GPT-3 (Brown et al., 2020) has further advanced TAG learning, focusing on: (1) text enhancement (e.g., better node descriptions, labels) (He et al., 2024a; Wang et al., 2024; Pan et al., 2024; He et al., 2024b), and (2) superior text encoding for node representations (Zhu et al., 2024; Huang et al., 2024), collectively boosting performance.

3 UltraTAG

In this section, we provide a detailed introduction to three modules of UltraTAG shown in Figure 2: Data Augmentation, Text Encoder and Training Mechanism.

3.1 Notations

Given a TAG $\mathcal{G}=\{\mathcal{V},\mathcal{T},\mathcal{A},\mathcal{Y}\}$ , where $\mathcal{V}$ is the set containing $N$ nodes, $\mathcal{T}$ is the set of texts, for $i\in\mathcal{V}$ , $t_{i}\in\mathcal{T}$ is the text attribute of node $i$ . $\mathcal{A}\in\mathbb{R}^{N\times N}$ is the adjacency matrix and $\mathcal{Y}$ is the set of ground-truth labels.

This study focuses on the TAG node classification task. The dataset is split into training nodes $\mathcal{V}_{tr}$ with training labels $\mathcal{Y}_{tr}$ and testing nodes $\mathcal{V}_{te}$ with testing labels $\mathcal{Y}_{te}$ . A model $f_{\theta^{*}}$ is trained on $\mathcal{V}_{tr}$ and tested on $\mathcal{V}_{te}$ to generate predictions. The optimization objective is formalized as:

f_{\theta^{*}}=\underset{\theta}{\operatorname*{argmax}}\mathbb{E}_{n\in% \mathcal{V}_{tr}}P_{\theta}(\hat{y}_{n}=y_{n}\mid n),

(1)

where $y_{n}$ is ground-truth label and $\hat{y}_{n}$ is model prediction.

3.2 Data Augmentation

TAGs rely solely on node texts, not representations, making text preprocessing crucial. To enhance text representation, data augmentation from a natural language perspective is effective. Leveraging LLM’s capabilities, we input $\mathcal{T}$ and generate augmented texts $\mathcal{T}^{{}^{\prime}}$ using varied prompts:

\mathcal{T}^{{}^{\prime}}=\{t^{{}^{\prime}}_{i}\mid t^{{}^{\prime}}_{i}=\text{% LLM}(\text{Prompt},t_{i},\alpha_{\text{LLM}}),\forall t_{i}\in\mathcal{T}\}.

(2)

After generating $\mathcal{T}^{{}^{\prime}}$ , they are typically aggregated with $\mathcal{T}$ to produce the final texts $\mathcal{T}^{*}$ as textual representation:

\mathcal{T}^{*}=\{t^{*}_{i}\mid t^{*}_{i}=\text{Agg}(t^{{}^{\prime}}_{i},\{t^{% {}^{\prime}}_{j}\mid j\in\mathcal{N}_{i}\}),\forall t_{i}^{{}^{\prime}}\in% \mathcal{T}^{{}^{\prime}}\},

(3)

where Agg is the text aggregator, which include selection and concatenation and so on, $\mathcal{N}_{i}$ is neighbor nodes of $i$ .

3.3 Text Encoder

The given nodes’ texts $\mathcal{T}^{*}$ must be encoded into embeddings to facilitate subsequent model processing which can be efficiently accomplished using LMs or LLMs.

LMs as Encoder. Text encoding typically employs LMs like BERT (Devlin et al., 2019). Fine-tuning LMs on downstream tasks enhances their task-specific encoding capability. As for $t_{i}^{*}\in\mathcal{T}^{*}$ , this process can be described as:

h_{i}=\text{LM}(t_{i}^{*},\theta_{\text{LM}})\in\mathbb{R}^{d},\forall t_{i}^{% *}\in\mathcal{T}^{*},

(4)

\theta^{*}_{\text{LM}}=\arg\min_{\theta,\theta^{*}_{\text{LM}}}\sum_{i=1}^{N}% \text{CE}(\text{MLP}(h_{i};\theta,\theta_{\text{LM}}),y_{i}),

(5)

where $h_{i}$ is the output of LM, $d$ is the dimension of the representation, $\theta$ is the parameters of the linear classifier, CE is the Cross-Entropy loss function.

Meanwhile, various downstream tasks can be used to fine-tune LMs, such as node classification (He et al., 2024a; Duan et al., 2023) or other tasks (Chien et al., 2022).

LLMs as Encoder. Leveraging LLM’s language understanding capabilities, their features from different layers capture varying abstraction levels with versatile representations (Zhu et al., 2024). Specifically, for $t_{i}^{*}\in\mathcal{T}^{*}$ , we can get $h_{i}^{1},h_{i}^{2},h_{i}^{3},...,h_{i}^{l}$ from different LLM layers:

h_{i}^{1},h_{i}^{2},h_{i}^{3},...,h_{i}^{l}=\text{LLM}(t_{i}^{*},\theta_{\text% {LLM}})\in\mathbb{R}^{d},\forall t_{i}^{*}\in\mathcal{T}^{*},

(6)

where $h_{i}^{j},j\in[1,l]$ denotes the output of LLM layer $j$ of node $i$ and $l$ is the number of LLM layers selected.

LLM’s interlayer features with multilevel text representations can also be integrated into GNN modules for message passing, optimizing only GNN parameters:

\theta^{*}=\arg\min_{\theta}\sum_{i=1}^{N}\sum_{j=1}^{L}\text{CE}(\mathcal{M}(% h_{i}^{j},\mathcal{A};\theta),y_{i}),

(7)

where $\mathcal{M}$ denotes the message passing module of GNN, $\mathcal{A}$ is the adjacent matrix, CE is the Cross-Entropy loss.

3.4 Training Mechanism

After obtaining the nodes’ textual representations from LM or LLM $\mathcal{H}=\{h_{1},h_{2},h_{3},...,h_{N}\}$ and adjacency matrix $\mathcal{A}$ , input of them into a GNN will yield the final prediction. In terms of training mechanisms, we can use a simple GNN module, or combine GNN with LM for joint training.

Simple GNN. A simple GNN produces final predictions through downstream task training and inference:

\theta^{*}=\arg\min_{\theta}\sum_{i=1}^{N}\text{CE}(\text{GNN}(h_{i},\mathcal{% A};\theta),y_{i}),

(8)

\text{Output}=\text{GNN}\left(\mathcal{H}=\{h_{1},h_{2},\dots,h_{N}\},\mathcal% {A};\theta^{*}\right),

(9)

where $\theta$ represents the parameters of the LM and the additional task-specific layers, CE is the Cross-Entropy loss function, and $h_{i}\in\mathcal{H}$ is the representation of node $i$ .

GNN with LM. For the combination of GNN and LM training, the pseudo-labels $\mathcal{Y}_{\text{G}}$ generated by GNN guide LM training, and the pseudo-labels generated by LM $\mathcal{Y}_{\text{L}}$ guide GNN training, and the cycle repeats:

\mathcal{Y}_{\text{L}}=\text{GNN}(\mathcal{H},\mathcal{A};\theta_{\text{G}}),% \mathcal{Y}_{\text{G}}=\text{LM}(\mathcal{T};\theta_{\text{L}}),

(10)

\theta^{*}_{\text{G}}\\ =\arg\min_{\theta_{\text{G}}}\sum_{i=1}^{N}\text{CE}(\text{GNN}(h_{i},\mathcal% {A};\theta_{\text{G}}),y_{i}\in\mathcal{Y}_{\text{L}}),

(11)

\theta^{*}_{\text{L}}\\ =\arg\min_{\theta_{\text{L}}}\sum_{i=1}^{N}\text{CE}(\text{LM}(t_{i};\theta_{% \text{L}}),y_{i}\in\mathcal{Y}_{\text{G}}).

(12)

By iteratively training the LM and GNN, the capabilities of both the GNN and LM can be enhanced simultaneously.

4 UltraTAG-S

To address the challenge of data sparsity of TAGs in real-word applications, we creatively propose UltraTAG-S as shown in Figure 3, which is composed of three modules: LLM-based Robustness Enhancement, LM-based Resilient Representation Learning and Graph-Enhanced Robust Classifier. These modules incorporate our unique design for sparse scenarios, such as Text Propagation, Structure Augmentation and Edge Reconfigurator. It should be noted that in this study, we simulate a real-world sparse scenario by deleting nodes’ texts and edges in a certain proportion.

4.1 LLM-based Robustness Enhancement

Data Augmentation module can be divided into Text Propagation, Text Augmentation and Structure Augmentation.

Text Propagation. Leveraging the homophily principle in graph theory, we posit that adjacent nodes exhibit textual similarity. Inspired by the message-passing mechanism in GNNs, we propagate textual information from neighboring nodes to reconstruct missing text attributes. Additionally, this propagation method enriches the textual representation of normal nodes, serving as a complementary enhancement.

Specifically, for node $v_{i}\in\mathcal{V}$ , $t_{i}\in\mathcal{T}$ is its text and $\mathcal{N}_{i}$ is the set of its neighbors, propagated texts $\mathcal{T}^{{}^{\prime}}$ is obtained by:

\mathcal{T}^{{}^{\prime}}=\{t^{{}^{\prime}}_{i}\mid t^{{}^{\prime}}_{i}=t_{i}% \oplus\{t_{j}\mid j\in\mathcal{N}_{i}\},\forall t_{i}\in\mathcal{T}\},

(13)

where $\oplus$ denotes concatenate of textual information.

Text Augmentation. Leveraging the advanced language comprehension capabilities of LLM, we utilize prompt engineering to harness these models for extracting critical textual information and enriching data representations. Specifically, we have developed a suite of tailored prompts to guide LLM in generating diverse and contextually relevant key texts, including summary, key words and soft labels, thereby enhancing the robustness and informativeness.

Specifically, for the propagated text $\mathcal{T}^{{}^{\prime}}$ , we get the augmented text $\mathcal{T}^{*}$ by LLM inference with different prompts:

\mathcal{T}_{\text{Su}}=\{t_{i}^{{}^{\prime\prime}}\mid t_{i}^{{}^{\prime% \prime}}=\text{LLM}(\mathcal{P}_{\text{Su}},t_{i}^{{}^{\prime}},\theta_{\text{% LLM}}),\forall t_{i}^{{}^{\prime}}\in\mathcal{T}^{{}^{\prime}}\},

(14)

\mathcal{T}_{\text{KW}}=\{t_{i}^{{}^{\prime\prime}}\mid t_{i}^{{}^{\prime% \prime}}=\text{LLM}(\mathcal{P}_{\text{KW}},t_{i}^{{}^{\prime}},\theta_{\text{% LLM}}),\forall t_{i}^{{}^{\prime}}\in\mathcal{T}^{{}^{\prime}}\},

(15)

\mathcal{Y}_{\text{SL}}=\{t_{i}^{{}^{\prime\prime}}\mid t_{i}^{{}^{\prime% \prime}}=\text{LLM}(\mathcal{P}_{\text{SL}},t_{i}^{{}^{\prime}},\theta_{\text{% LLM}}),\forall t_{i}^{{}^{\prime}}\in\mathcal{T}^{{}^{\prime}}\},

(16)

\mathcal{T}^{*}=\text{AGG}(\mathcal{T}^{{}^{\prime}},\mathcal{T}_{\text{Su}},% \mathcal{T}_{\text{KW}},\mathcal{Y}_{\text{SL}}),

(17)

where AGG denotes the text aggregation module of concat or select, $\mathcal{T}_{\text{Su}}$ , $\mathcal{T}_{\text{KW}}$ , $\mathcal{Y}_{\text{SL}}$ are the Summary, Key Words and Soft Labels generated by LLMs respectively. $\mathcal{P}_{\text{Su}}$ , $\mathcal{P}_{\text{KW}}$ , $\mathcal{P}_{\text{SL}}$ are the corresponding tailored prompts of LLMs.

Structure Augmentation. To mitigate edge sparsity, we introduce a Structure Augmentation module composed of Virtual Edge Generator, Node Selector and Edge Reconfigurator. This module leverages LLMs to re-identify edges for selected nodes, thereby optimizing the graph structure.

a. Virtual Edge Generator. In order to ensure the integrity of the graph structure before node selection, we use the soft labels $\mathcal{T}_{\text{SL}}$ generated by LLMs and calculate the similarity with the same soft label, which can be described as:

h_{i}=\text{LM}(t_{i}^{*},\alpha_{\text{LM}}),h_{j}=\text{LM}(t_{j}^{*},\alpha% _{\text{LM}}),

(18)

\mathcal{S}_{ij}=cos(h_{i},h_{j}),\forall y_{i}=y_{j}\And y_{i},y_{j}\in% \mathcal{Y}_{\text{SL}}.

(19)

The adjacency matrix with virtual edges is updated to $\mathcal{A}^{{}^{\prime}}$ :

\mathcal{A}^{{}^{\prime}}_{ij}=\begin{cases}1,&\text{if }\mathcal{A}_{ij}=1|% \mathcal{S}_{ij}>\tau_{1},\\ 0,&\text{else},\end{cases}

(20)

where $\mathcal{A}$ denotes the adjacency matrix after sparse process, $\tau_{1}$ denotes the similarity threshold for edges to add.

b. Node Selector. Considering the impracticality of re-judging all edges, we design a node selector to select important nodes set $\mathcal{V}_{c}$ . We calculate the pagerank score for each node in $\mathcal{V}$ and use these scores as importance score:

\text{Score}(v_{i})=\text{PageRank}(v_{i},\mathcal{A}^{{}^{\prime}}),

(21)

\mathcal{V}_{c}=\{v_{i}\ |\ \text{Score}(v_{i})>\text{Score}(v_{k})\},

(22)

where $v_{k}$ denotes the node with k-th largest node importance score calculated by PageRank algorithm with virtual edges.

c. Edge Reconfigurator. For each edge in the complete graph of $\mathcal{V}_{c}$ , we use LLM to re-determines its existence with the confidence score $\mathcal{C}_{ij}$ of edge $e_{ij}$ :

\mathcal{C}_{ij}=\text{LLM}(\mathcal{P}_{\text{edge}},t_{i}^{*},t_{j}^{*},% \alpha_{\text{LLM}}),\forall v_{i},v_{j}\in\mathcal{V}_{c},

(23)

The updated adjacency matrix $\mathcal{A}^{*}$ can be expressed as:

\mathcal{A}^{*}_{ij}=\begin{cases}\mathcal{A}_{ij},&\text{if }v_{i}\notin% \mathcal{V}_{c}\ |\ v_{j}\notin\mathcal{V}_{c};\\ 1,&\text{if }\mathcal{C}_{ij}>\tau_{2};\quad 0,\ \text{else};\end{cases}

(24)

where $\tau_{2}$ is the confidence threshold for reconfiguration.

4.2 LM-based Resilient Representation Learning

After augmenting the graph $\mathcal{G^{*}}=\{\mathcal{V},\mathcal{T}^{*},\mathcal{A}^{*},\mathcal{Y}\}$ , we fine-tune the language model $\text{LM}_{\theta}$ for node classification:

\hat{y}_{i}=\text{softmax}(W\cdot\text{LM}(t_{i},\theta)+b),\forall\ t_{i}^{*}% \in\mathcal{T}^{*},

(25)

where $W$ is the weight matrix, $b$ is the bias term.

After fine-tuning with the following negative log-likelihood Loss $\mathcal{L}$ , the node representations $\mathcal{H}$ are calculated by:

h_{i}=\text{LM}(t_{i}^{*},\theta^{*}),\quad\mathcal{L}=-\frac{1}{N}\sum_{i=1}^% {N}\sum_{k=1}^{K}y_{i,k}\log\hat{y}_{i,k},

(26)

where $N,K$ are the number of training nodes and classes, $y_{i,k}$ is the ground-truth, $\hat{y}_{i,k}$ is the model prediction.

4.3 Graph-Enhanced Robust Classifier

We employ a dual-GNN framework to tackle edge sparsity: one GNN learns enhanced graph structures with similarity, and the other focuses on node classification.

Specifically, with the nodes’ representations $\mathcal{H}\in\mathbb{R}^{N\times d}$ and the structure representation $\mathcal{A}^{*}\in\mathbb{R}^{N\times N}$ , we first compute the similarity matrix of node vector representations:

\mathbf{S}=\text{Norm}(\mathcal{H}^{(1)}\cdot\mathcal{H}^{(1)^{\top}}),% \mathcal{H}^{(1)}=\text{GNN}_{1}(\mathcal{H},\mathcal{A}^{*}).

(27)

Then, we update the original adjacency matrix $\mathcal{A}^{*}$ with preserving the judgment of the LLM without alteration:

\mathcal{\tilde{A}}^{*}_{ij}=\begin{cases}\mathcal{A}^{*}_{ij}+\mathbf{S}_{ij}% ,&\text{if }v_{i}\notin\mathcal{V}_{c}\ |\ v_{j}\notin\mathcal{V}_{c},\\ \mathcal{A}^{*}_{ij},&\text{else}.\end{cases}

(28)

We use the updated matrix as the input of $\text{GNN}_{2}(\cdot)$ and jointly optimize $\text{GNN}_{1}(\cdot)$ and $\text{GNN}_{2}(\cdot)$ using cross entropy:

\mathcal{H}^{(2)}=\text{GNN}_{2}(\mathcal{H},\mathcal{\tilde{A}}^{*}),\mathcal% {L}=-\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K}y_{i,k}\log\hat{y}_{i,k},

(29)

where $N$ is nodes’ number, and $K$ is classes’ number.

Table 1: The Comparison of Different LLM-enhanced TAG Learning Methods under UltraTAG. The top four methods use LMs, while the bottom five use LLMs. ’XMC’ is eXtream Multi-label Classification, ’Iteration’ is iterative training with LM and GNN using pseudo labels, ’Joint’ means joint training with multiple GNNs, ’Combined Loss’ means using a custom loss function, ’BoW’ is bag of words.

Method	Data Augmentation	Text Encoder	Encoder Supervision	Traing Mechanism
GLEM	✗	DeBERTa	Node Classification	Iteration
GIANT	✗	BERT	XMC	Only GNN
G2P2	✗	RoBERTa	Node Classification	Combined Loss
SimTeG	✗	e5-large/RoBERTa	Node Classification	Only GNN
TAPE	✓	DeBERTa	Node Classification	Only GNN
ENGINE	✗	LLaMA2-7B	/	Joint
LLMGNN	✓	BoW	/	Only GNN
GraphAdapter	✗	LLaMA2-13B	Token Prediction	Only GNN
UltraTAG-S	✓	BERT	Node Classification	Joint

Table 2: The Comparison of Different LLM-enhanced TAG Learning Methods for Sparse Scenarios Robustness. ’Input’, ’Node’, ’Edge’, ’Training’ denote the consideration of input robustness, node robustness, edge robustness and training robustness.

Robustness	Input	Node	Edge	Training
GLEM	✗	✗	✗	✗
GIANT	✗	✗	✗	✗
G2P2	✓	✓	✗	✗
SimTeG	✗	✗	✗	✗
TAPE	✓	✗	✗	✗
ENGINE	✗	✗	✗	✓
LLMGNN	✓	✗	✓	✗
GraphAdapter	✓	✗	✗	✓
UltraTAG-S	✓	✓	✓	✓

5 Experiments

In this section, we analyze the effectiveness of UltraTAG-S through experimental evaluation. To comprehensively assess the performance of our approach, we address the following questions: Q1: What are the differences between existing TAGs learning methods under UltraTAG? Q2: What is the performance of UltraTAG-S as a general and robust TAGs learning paradigm in ideal and sparse scenarios? Q3: What factors contribute to the performance and robustness of UltraTAG-S? Q4: What is the training time complexity and hyperparameter setting of UltraTAG-S? Details of the datasets and baselines are in Appendix A and B.

5.1 Paradigm Comparision

In this section, we compare the similarities and differences of the current LLM-enhanced TAG learning methods under the framework of UltraTAG from four aspects, namely Data Augmentation, Text Encoder, Encoder Supervision, and Training Mechanism, as shown in Table 1. LM-based methods (Zhao et al., 2023; Chien et al., 2022; Wen & Fang, 2023; Duan et al., 2023) utilize distinct language models and fine-tuning tasks, while LLM-based methods (He et al., 2024a; Chen et al., 2024) focus on data augmentation. Meanwhile, anyone can optimize one small module among the four modules of UltraTAG to form a new baseline.

We also compare the sparse scenarios robustness of Different LLM-enhanced TAG Learning Methods from four aspects, namely Input Robustness, Node Robustness, Edge Robustness and Training Robustness, which means whether these methods consider the robustness of data and performance across the above four dimensions. The comparison is shown as Table 2, we find that only UltraTAG-S(Ours) takes into account the robustness across all dimensions.

5.2 Performance and Robustness Comparision

Table 3: Experimental results of node classification. Optimal performance is in bold and sub-optimal performance is underlined.

Methods	Cora	CiteSeer	PubMed	WikiCS	Instagram	Reddit	Elo-Photo
MLP	54.94±3.68	61.91±1.67	52.13±7.35	62.46±0.70	51.85±10.78	52.91±1.81	47.45±9.18
GCN	74.91±8.71	69.00±2.83	72.54±6.95	73.27±4.62	63.66±1.11	55.00±4.34	69.49±3.93
GAT	71.70±2.75	70.31±1.01	75.86±1.08	68.72±0.90	64.80±0.22	60.36±0.25	64.22±2.58
GCNII	77.23±0.66	71.91±1.05	73.28±1.67	70.12±1.81	65.07±0.59	62.78±0.49	60.60±1.36
GraphSAGE	81.70±1.00	66.68±0.80	68.41±9.59	75.16±0.33	59.65±5.78	53.59±2.24	70.48±6.03
BERT	79.70±0.32	76.88±0.41	90.95±0.11	71.70±1.09	63.50±0.09	58.78±0.05	70.01±0.08
DeBERTa	73.39±4.54	75.16±1.08	90.81±0.20	68.18±4.10	62.40±0.59	59.92±0.45	70.18±0.18
RoBERTa	80.35±0.48	77.04±1.49	91.13±0.11	72.12±0.70	64.67±0.34	59.23±0.06	70.25±0.34
GLEM	87.07±1.01	76.30±2.45	89.56±1.65	74.83±0.95	65.90±0.36	60.88±0.03	77.74±0.27
SimTeG	88.75±0.42	77.37±0.64	88.31±0.75	76.32±0.53	64.29±0.19	61.60±0.88	79.82±0.21
TAPE	89.07±0.56	77.02±0.71	90.38±0.99	80.17±0.18	65.44±0.35	63.01±0.82	82.26±0.64
ENGINE	86.79±0.58	78.03±0.48	91.43±0.13	81.38±0.38	66.27±0.41	62.57±0.13	83.06±0.22
UltraTAG-S	90.96±0.45	78.68±0.21	92.41±0.30	83.05±0.16	66.69±0.14	63.78±0.30	84.70±0.03

We conduct a comprehensive evaluation of UltraTAG-S by comparing with GNN-only, LM-only and LLM-GNN methods, as the results in Table 3. Since GNN-only methods cannot accept texts as input, in order to make a fair comparison, we encode these methods using a unified BERT (Devlin et al., 2019) to get unified representation as input. As can be seen from Table 3, the performance of UltraTAG-S on all datasets is better than that of the current existing methods, and the improvement of the effect is up to 2.21%.

In order to simulate the challenges of the sparse scene, we randomly delete the texts and edges of nodes in a ratio of 20%, 50%, and 80% without considering data noise or additional constraints. As illustrated in Figure 4, our proposed method, UltraTAG-S, demonstrates the best robustness compared with current TAG learning baselines. Specifically, our method maintains the smallest decline in classification accuracy under sparse scenarios with superior adaptability.

The details of performance in sparse ratio of 80% is shown in Table 4. As can be seen from the results in sparse scenarios, our proposed UltraTAG-S can also achieve SOTA node classification accuracy in extremely sparse scenarios 80%, and the performance enhancement of UltraTAG-S is up to 17.5% in sparse ratio of 80%. The details with sparse ratio of 20% and 50% are shown in Appendix E Table 9, 10.

As shown in Figure 5, UltraTAG-S consistently achieves the highest accuracy across all datasets under varying sparsity levels, demonstrating optimal robustness. And the robustness improves significantly as data sparsity increases, highlighting the effectiveness in extreme data sparsity.

Table 4: Robustness Comparison in Sparse Scenarios with Ratio of 80%, which means nodes’ texts and edges with proportion of 80% are removed randomly to simulate real-world scenario. Optimal performance is in bold and sub-optimal performance is underlined.

Methods	Cora	CiteSeer	PubMed	WikiCS	Instagram	Reddit	Elo-Photo
MLP	30.41±0.59	27.74±0.59	44.25±1.68	26.64±1.17	62.54±1.02	50.93±0.41	46.21±0.61
GCN	41.96±0.73	30.53±0.68	49.68±1.31	54.24±2.29	63.20±0.31	54.38±1.88	51.27±0.33
GAT	38.86±0.59	30.50±0.27	52.03±0.30	52.85±0.71	63.17±0.86	56.41±0.25	50.35±0.85
GCNII	36.79±0.28	31.07±0.97	51.33±0.37	50.83±1.32	62.27±0.84	57.62±1.10	45.53±0.15
GraphSAGE	37.60±0.67	31.69±0.44	50.59±0.71	53.03±0.71	61.70±0.93	58.72±0.81	52.17±0.65
BERT	37.59±0.08	31.50±0.54	49.95±0.04	28.58±1.24	62.50±0.43	51.40±0.15	49.59±0.04
DeBERTa	29.98±1.09	30.80±0.55	42.34±4.11	21.83±1.06	63.59±0.27	50.24±0.28	47.96±1.47
RoBERTa	28.23±0.00	23.32±4.55	47.24±4.20	20.36±0.40	63.68±0.00	50.32±0.21	49.52±0.12
GLEM	49.01±0.58	36.64±1.46	51.48±0.54	52.41±0.76	61.54±0.56	50.82±1.04	56.25±2.14
SimTeG	45.78±0.22	30.40±0.66	54.95±0.61	50.35±0.72	60.61±0.16	58.08±0.12	55.73±0.84
TAPE	47.08±0.20	29.77±0.28	54.87±0.50	59.83±0.77	61.25±0.59	58.10±0.72	59.76±0.12
ENGINE	42.32±0.66	35.70±0.19	54.74±0.09	49.42±0.45	63.88±0.20	57.54±0.77	57.96±0.13
UltraTAG-S	57.57±1.38	40.08±0.45	61.05±0.49	65.60±0.34	64.78±0.67	59.85±0.01	68.79±0.07

5.3 Ablation Study

In this part, we perform an ablation study on the CiteSeer, and PubMed datasets to verify the validity and robustness of each UltraTAG-S module, particularly in sparse scenarios. The results for PubMed and CiteSeer are illustrated in Figure 6. It is evident that each module of UltraTAG-S significantly contributes to the improvement of model performance and robustness.

Specifically, the Text Augmentation module enhances the model’s ability to generalize by introducing diverse textual variations, leading to a performance improvement of up to 16.89% on the CiteSeer dataset and 55.45% on PubMed. This module is particularly effective in scenarios where textual diversity is limited, as it enriches the input data and reduces overfitting. The Structure Augmentation module further contributes to the model’s robustness by optimizing the graph structure, achieving performance improvements of 3.07% on CiteSeer and 5.90% on PubMed, especially in sparse data scenarios. As for the Structure Learning module, it demonstrates even more substantial gains, with improvements of 32.49% on CiteSeer and 40.09% on PubMed, highlighting its ability to capture complex relationships within the graph. It is evident that the Structure Learning module plays the most significant role in enhancing both the effectiveness and robustness of UltraTAG-S, as it not only improves accuracy but also ensures stable performance across varying data conditions. These results underscore the importance of combining these modules to achieve optimal performance in graph-based learning tasks.

5.4 Complexity Analysis and Hyperparameter Setting

The computational complexity of our proposed method is primarily determined by two GNN operations. The first GNN calculates the similarity matrix $\mathbf{S}\in\mathbb{R}^{N\times N}$ and updates the adjacency matrix $\mathcal{A}^{*}_{ij}$ . This step involves pairwise computations between nodes, leading to a complexity of $O(N^{2})$ . The second GNN performs node classification using the updated adjacency matrix $\mathcal{\tilde{A}}^{*}_{ij}$ and node features $\mathcal{H}$ . With $m$ layers and $\mathcal{E}=O(N^{2})$ edges in the graph, the complexity for this operation scales as $O(m\cdot N^{2})$ . Therefore, the total computational cost per epoch is dominated by these two steps, resulting in an overall complexity of $O(m\cdot N^{2})$ .

For detailed parameter settings, we employ five random seeds {42, 43, 44, 45, 46}. The $\text{GNN}_{1}(\cdot)$ and $\text{GNN}_{2}(\cdot)$ use the Adam optimizer for joint optimizing with a learning rate of 1e-2, weight decay of 5e-4, dropout of 0.5, and 100 epochs. Each GNN consists of 2 layers, with a similarity calculation threshold of 0.8. The number of important nodes selected by PageRank is 10% of the total training nodes, and the acceptance threshold for LLM-based edge reconfiguration is 0.5. For fine-tuning the LM, the learning rate is set to 5e-5, with 3 epochs, a batch size of 8, and a dropout of 0.3. The LLM used is Meta-Llama-3-8B-Instruct. All experiments were conducted on a system equipped with a NVIDIA A100 80GB PCIe GPU and an Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz, with CUDA Version 12.4.

6 Conclusion and Future Work

In response to the current LLM-enhanced TAG Learning methods, we first propose UltraTAG as a unified and domain-adaptive pipeline learning framework. Simultaneously, to address the challenges faced by existing LLM-enhanced TAG learning methods in real-world sparse scenarios, such as nodes’ texts missing or edges missing, we introduce UltraTAG-S, a TAG learning paradigm specifically tailored for sparse scenarios. UltraTAG-S effectively resolves the issues of nodes’ texts sparsity and edge sparsity in real-world settings through LLM-based text propagation strategy and text augmentation strategy, as well as PageRank and LLM-based graph structure learning strategies, achieving state-of-the-art performance in both ideal and sparse scenarios. In the future, we will further explore the pivotal role of text propagation strategies in TAG representation learning.

References

Brown et al. (2020) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
Chen et al. (2020) Chen, M., Wei, Z., Huang, Z., Ding, B., and Li, Y. Simple and deep graph convolutional networks. In International Conference on Machine Learning, ICML, 2020.
Chen et al. (2024) Chen, Z., Mao, H., Wen, H., Han, H., Jin, W., Zhang, H., Liu, H., and Tang, J. Label-free node classification on graphs with large language models (llms), 2024. URL https://arxiv.org/abs/2310.04668.
Chien et al. (2022) Chien, E., Chang, W.-C., Hsieh, C.-J., Yu, H.-F., Zhang, J., Milenkovic, O., and Dhillon, I. S. Node feature extraction by self-supervised multi-scale neighborhood prediction, 2022. URL https://arxiv.org/abs/2111.00064.
Devlin et al. (2019) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/abs/1810.04805.
Duan et al. (2023) Duan, K., Liu, Q., Chua, T.-S., Yan, S., Ooi, W. T., Xie, Q., and He, J. Simteg: A frustratingly simple approach improves textual graph learning, 2023. URL https://arxiv.org/abs/2308.02565.
Giles et al. (1998) Giles, C. L., Bollacker, K. D., and Lawrence, S. Citeseer: An automatic citation indexing system. In Proceedings of the third ACM conference on Digital libraries, 1998.
Guo et al. (2023) Guo, D., Chu, Z., and Li, S. Fair attribute completion on graph with missing attributes, 2023. URL https://arxiv.org/abs/2302.12977.
Hamilton et al. (2017) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. Advances in Neural Information Processing Systems, NeurIPS, 2017.
Harris (1954) Harris, Z. S. Distributional structure. 1954. URL https://api.semanticscholar.org/CorpusID:86680084.
He et al. (2021) He, P., Liu, X., Gao, J., and Chen, W. Deberta: Decoding-enhanced bert with disentangled attention, 2021. URL https://arxiv.org/abs/2006.03654.
He et al. (2024a) He, X., Bresson, X., Laurent, T., Perold, A., LeCun, Y., and Hooi, B. Harnessing explanations: Llm-to-lm interpreter for enhanced text-attributed graph representation learning, 2024a. URL https://arxiv.org/abs/2305.19523.
He et al. (2024b) He, Y., Sui, Y., He, X., and Hooi, B. Unigraph: Learning a unified cross-domain foundation model for text-attributed graphs, 2024b. URL https://arxiv.org/abs/2402.13630.
Huang et al. (2024) Huang, X., Han, K., Yang, Y., Bao, D., Tao, Q., Chai, Z., and Zhu, Q. Can gnn be good adapter for llms?, 2024. URL https://arxiv.org/abs/2402.12984.
Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, ICLR, 2017.
Li et al. (2024a) Li, X., Liao, M., Wu, Z., Su, D., Zhang, W., Li, R.-H., and Wang, G. Lightdic: A simple yet effective approach for large-scale digraph representation learning. arXiv preprint arXiv:2401.11772, 2024a.
Li et al. (2024b) Li, Z., Li, R.-H., Liao, M., Jin, F., and Wang, G. Privacy-preserving graph embedding based on local differential privacy, 2024b. URL https://arxiv.org/abs/2310.11060.
Liu et al. (2019) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta: A robustly optimized bert pretraining approach, 2019. URL https://arxiv.org/abs/1907.11692.
Mernyei & Cangea (2020) Mernyei, P. and Cangea, C. Wiki-cs: A wikipedia-based benchmark for graph neural networks. arXiv preprint arXiv:2007.02901, 2020.
Mikolov et al. (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. Distributed representations of words and phrases and their compositionality, 2013. URL https://arxiv.org/abs/1310.4546.
Ni et al. (2019) Ni, J., Li, J., and McAuley, J. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proc. of EMNLP, 2019.
Pan et al. (2024) Pan, B., Zhang, Z., Zhang, Y., Hu, Y., and Zhao, L. Distilling large language models for text-attributed graph learning, 2024. URL https://arxiv.org/abs/2402.12022.
Rossi et al. (2022) Rossi, E., Kenlay, H., Gorinova, M. I., Chamberlain, B. P., Dong, X., and Bronstein, M. On the unreasonable effectiveness of feature propagation in learning on graphs with missing node features, 2022. URL https://arxiv.org/abs/2111.12128.
Sen et al. (2008) Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. AI magazine, 2008.
Singh & Sachan (2014) Singh, G. and Sachan, M. Multi-layer perceptron (mlp) neural network technique for offline handwritten gurmukhi character recognition. In 2014 IEEE International Conference on Computational Intelligence and Computing Research, pp. 1–5, 2014. doi: 10.1109/ICCIC.2014.7238334.
Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations, ICLR, 2018.
Wang et al. (2024) Wang, Y., Zhu, Y., Zhang, W., Zhuang, Y., Li, Y., and Tang, S. Bridging local details and global context in text-attributed graphs, 2024. URL https://arxiv.org/abs/2406.12608.
Wen & Fang (2023) Wen, Z. and Fang, Y. Augmenting low-resource text classification with graph-grounded pre-training and prompting. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, pp. 506–516. ACM, July 2023. doi: 10.1145/3539618.3591641. URL http://dx.doi.org/10.1145/3539618.3591641.
Yan et al. (2023) Yan, H., Li, C., Long, R., Yan, C., Zhao, J., Zhuang, W., Yin, J., Zhang, P., Han, W., Sun, H., et al. A comprehensive study on text-attributed graphs: Benchmarking and rethinking. In Proc. of NeurIPS, 2023.
Zhang et al. (2022) Zhang, W., Yin, Z., Sheng, Z., Li, Y., Ouyang, W., Li, X., Tao, Y., Yang, Z., and Cui, B. Graph attention multi-layer perceptron. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, pp. 4560–4570. ACM, August 2022. doi: 10.1145/3534678.3539121. URL http://dx.doi.org/10.1145/3534678.3539121.
Zhao et al. (2023) Zhao, J., Qu, M., Li, C., Yan, H., Liu, Q., Li, R., Xie, X., and Tang, J. Learning on large-scale text-attributed graphs via variational inference, 2023. URL https://arxiv.org/abs/2210.14709.
Zhu et al. (2024) Zhu, Y., Wang, Y., Shi, H., and Tang, S. Efficient tuning and inference for large language models on textual graphs, 2024. URL https://arxiv.org/abs/2401.15569.

Appendix A Datasets

This section provides a detailed introduction to the datasets used in the main content. The statistics of the TAG datasets we use is as shown in Table 5. The details of each dataset are as follows:

Table 5: Statistics of the TAG datasets. The datasets are partitioned in Train-Val-Test-Out mode, ’Out’ is data that not involved in the partitioning of training, validation, or test sets. All datasets are evaluated by node classification accuracy.

Dataset	#Nodes	#Edges	#Classes	#Split Ratio(%)
Cora	2,708	5,278	7	60-20-20-0
CiteSeer	3,186	4,277	6	60-20-20-0
PubMed	19,717	44,324	3	60-20-20-0
WikiCS	11,701	215,863	10	5-15-50-30
Instagram	11,339	144,010	2	10-10-80-0
Reddit	33,434	198,448	2	10-10-80-0
Elo-Photo	48,362	873,793	12	40-15-45-0

Cora (Sen et al., 2008) dataset comprises 2,708 scientific publications, which are classified into seven categories: Case-based, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, and Theory. Each publication in this citation network either cites or is cited by at least one other publication, forming a total of 5,278 edges. For our study, we utilize the dataset with raw texts provided by TAPE (He et al., 2024a), available at the following repository¹¹1Cora Dataset.

CiteSeer (Giles et al., 1998) dataset contains 3,186 scientific publications, categorized into six classes: Agents, Machine Learning, Information Retrieval, Databases, Human-Computer Interaction, and Artificial Intelligence. The objective is to predict the category of each publication using its title and abstract.

PubMed (Sen et al., 2008) dataset comprises 19,717 scientific publications from the PubMed database related to diabetes. These publications are categorized into three classes: Experimentally Induced Diabetes, Type 1 Diabetes, and Type 2 Diabetes. The associated citation network contains a total of 44,324 links.

WikiCS (Mernyei & Cangea, 2020) dataset is a Wikipedia-based resource developed for benchmarking Graph Neural Networks. It is derived from Wikipedia categories and includes 10 classes representing various branches of computer science, characterized by a high degree of connectivity. The node features are extracted from the text of the associated articles. The raw text for each node is obtained from the following repository ²²2WikiCS Dataset.

Instagram (Huang et al., 2024) dataset serves as a social network where nodes represent users and edges correspond to following relationships. The classification task involves distinguishing between commercial and normal users within this network.

Reddit (Huang et al., 2024) dataset is a social network where nodes represent users, and node features are derived from the content of users’ historically published subreddits. Edges indicate whether two users have replied to each other. The classification task involves determining whether a user belongs to the top 50% in popularity, based on the average score of all their subreddits. This dataset is built on a public resource³³3Reddit Dataset, which collected replies and scores from Reddit users. The node text features are generated from each user’s historical post content, limited to their last three posts. Users are categorized as popular or normal based on the median of average historical post scores, with those exceeding the median classified as popular and the rest as normal.

Ele-Photo (Yan et al., 2023) dataset is derived from the Amazon-Electronics dataset (Ni et al., 2019). In this dataset, nodes represent electronics-related products, and edges signify frequent co-purchases or co-views between products. Each node is labeled based on a three-level classification scheme for electronics products. User reviews serve as the textual attributes for the nodes; when multiple reviews are available for a product, the review with the highest number of votes is selected. If no such review exists, a random review is used. The task is to classify electronics products into 12 predefined categories.

Appendix B Baselines

This section contains detailed information about baselines:

MLP (Singh & Sachan, 2014) is a simple feed-forward neural network model, commonly used for baseline classification tasks. It consists of multiple layers of neurons, where each layer is fully connected to the previous one. The model is trained via backpropagation, with the final output layer producing predictions.

GCN (Kipf & Welling, 2017)is a graph-based neural network model that performs node classification tasks by aggregating information from neighboring nodes. The model is built on graph convolutional layers, where each node’s embedding is updated by combining the features of its neighbors, enabling it to capture the graph structure effectively.

GAT (Veličković et al., 2018)introduces attention mechanisms to graph convolutional networks, allowing nodes to weigh their neighbors differently when aggregating features. This attention mechanism helps GAT to focus on the most informative neighbors, making it particularly effective in graphs with heterogeneous relationships between nodes.

GCNII (Chen et al., 2020)is an improved version of the GCN model, which integrates higher-order graph convolutions and a skip connection strategy. This enhancement enables GCNII to better capture deep graph structures and mitigate the over-smoothing problem that arises in deep GCN architectures.

GraphSAGE (Hamilton et al., 2017)is an inductive framework for graph representation learning, where node embeddings are learned by sampling and aggregating features from neighbors. This model can be applied to large-scale graphs by utilizing different aggregation functions, such as mean, pooling, or LSTM-based aggregation.

BERT (Devlin et al., 2019)is a pre-trained transformer-based model that learns contextualized word embeddings by predicting missing words in a sentence. BERT’s bidirectional attention mechanism allows it to capture rich contextual information.

DeBERTa (He et al., 2021)improves upon BERT by introducing disentangled attention and enhanced decoding strategies. These innovations allow DeBERTa to better capture the relationships between different parts of the input text, leading to improved performance on multiple natural language understanding tasks.

RoBERTa (Liu et al., 2019)is an optimized version of BERT that increases training data size and model capacity, removes the Next Sentence Prediction (NSP) objective, and fine-tunes hyperparameters. These modifications lead to improved performance over BERT on many benchmark tasks, especially in natural language understanding.

GLEM (Zhao et al., 2023)is a method for learning on large TAGs. It uses a variational EM framework to alternately update LMs and GNNs, improving scalability and performance in node classification.

SimTeG (Duan et al., 2023)is a straightforward yet effective approach for textual graph learning. It first conducts parameter-efficient fine - tuning (PEFT) on LM using downstream task labels. Then, it generates node embeddings from the fine-tuned LM. These embeddings are further used by a GNN for training on the same task.

TAPE (He et al., 2024a)is an approach for TAGs representation learning. It uses LLMs to generate predictions and explanations, which are then transformed into node features by fine-tuning a smaller LM. These features are used to train a GNN.

ENGINE (Zhu et al., 2024)is an efficient tuning method for integrating LLMs and GNNs in TAGs. It attaches a tunable G-Ladder to each LLM layer to capture structural information, freezing LLM parameters to reduce training complexity. ENGINE with caching can speed up training by 12x. ENGINE (Early) uses dynamic early exit, achieving up to 5x faster inference with minimal performance loss.

G2P2 (Wen & Fang, 2023)is a model for low-resource text classification. It has two main stages. During pre-training, it jointly trains a text encoder and a graph encoder using three graph interaction-based contrastive strategies, including text-node, text-summary, and node-summary interactions, to learn a dual-modal embedding space. In downstream classification, it uses prompting. For zero-shot classification, it uses handcrafted discrete prompts, and for few-shot classification, it uses continuous prompts with graph context-based initialization.

LLMGNN (Chen et al., 2024)is a pipeline for label-free node classification on graphs. It uses LLMs to annotate nodes and GNNs for prediction. It selects nodes considering annotation difficulty, gets confidence - aware annotations, and post - filters to improve annotation quality, achieving good results at low cost.

GIANT (Chien et al., 2022)is a self-supervised learning framework for graph-guided numerical node feature extraction. It addresses the graph-agnostic feature extraction issue in standard GNN pipelines. By formulating neighborhood prediction as an XMC problem and using XR-Transformers, it fine-tunes language models with graph information.

GraphAdapter (Huang et al., 2024)is an approach that uses GNN as an efficient adapter for LLMs to model TAGs. It conducts language-structure pre-training to jointly learn with frozen LLMs, integrating structural and textual information. After pre-training, it can be fine-tuned with prompts for downstream tasks. Experiments show it outperforms baselines on multiple datasets.

Appendix C LLM Prompts

The LLM employed in our study is Meta-Llama-3-8B-Instruct, which is utilized for both text augmentation and structure augmentation tasks. This section provides a comprehensive overview of all the prompts we designed and implemented. Each prompt follows a consistent structure comprising ”Dataset Description + Question,” where the dataset description serves to contextualize the query and ensure clarity. The detailed dataset descriptions are:

For different inference scenarios, our approach to querying the LLM is as follows. Due to the uniformity of the query format, the following demonstrations of Key Words and Soft Labels etc. use the Cora dataset as an example:

Appendix D Ablation Study and Backbones

This section presents more detailed ablation study results. We conducted ablation study under all sparse conditions on the Cora, CiteSeer, and PubMed datasets, as shown in Table 7. Furthermore, we tested the impact of different language model backbones on node classification accuracy, performing comparative experiments under all sparse conditions on the Cora, CiteSeer and PubMed datasets as well, as shown in Table 6. From the table, it can be observed that under conditions of lower data sparsity, using the BERT model for text encoding yields superior classification accuracy. However, when the data sparsity reaches as high as 80%, employing the RoBERTa model for text encoding results in better classification accuracy.

Additionally, we briefly explored the impact of different text propagation strategies on the model’s classification accuracy, with the experimental results presented in Table 8. From the table, it is evident that enhancing the original text data with more effective LLM-augmented texts can significantly improve the performance across all sparse conditions.

Table 6: Performance Comparison with Different Text Encoders in ideal scenarios and sparse scenarios in ratio of 20%, 50% and 80%. The optimal performance is in bold.

Encoder	Cora	CiteSeer	PubMed
Sparse Ratio	0%	0%	0%
BERT	90.96±0.45	78.68±0.21	91.99±1.21
DeBERTa	83.39±0.39	76.02±1.45	92.06±1.77
RoBERTa	88.38±1.82	77.74±0.76	92.41±0.30
Sparse Ratio	20%	20%	20%
BERT	88.93±0.74	77.12±0.28	89.38±0.29
DeBERTa	81.55±0.59	72.88±0.61	89.33±0.11
RoBERTa	88.75±0.22	74.92±0.46	89.88±0.21
Sparse Ratio	50%	50%	50%
BERT	83.95±0.80	67.08±0.28	80.76±0.26
DeBERTa	73.99±0.45	62.70±1.08	80.53±0.21
RoBERTa	80.81±0.54	65.05±0.14	80.91±0.29
Sparse Ratio	80%	80%	80%
BERT	57.57±1.38	40.08±0.45	60.70±0.74
DeBERTa	52.21±0.76	39.03±0.64	60.62±0.76
RoBERTa	54.61±1.06	38.87±0.88	61.05±0.49

Table 7: Detailed Performance Comparison of Ablation Study. ’w/o TA’ means without Text Augmentation module, ’w/o SA’ means without Structure Augmentation module, ’w/o SL’ means without Structure Learning module.

Method	Cora	CiteSeer	PubMed
Sparse Ratio	0%	0%	0%
w/o TA	89.59±1.39	77.37±0.96	91.29±0.90
w/o SA	90.59±0.86	78.37±1.63	91.99±1.75
w/o SL	88.38±0.85	77.59±1.44	86.36±1.23
UltraTAG-S	90.96±0.45	78.68±0.21	92.41±0.30
Sparse Ratio	20%	20%	20%
w/o TA	86.52±0.21	75.12±0.45	88.38±1.50
w/o SA	87.93±1.72	77.01±0.31	89.18±0.58
w/o SL	86.72±1.33	68.34±0.90	85.42±1.34
UltraTAG-S	88.93±0.74	77.12±0.28	89.88±0.21
Sparse Ratio	50%	50%	50%
w/o TA	78.41±0.32	59.94±1.07	52.05±0.45
w/o SA	81.95±1.30	65.08±1.06	78.76±0.22
w/o SL	74.54±0.54	50.63±1.42	69.17±0.71
UltraTAG-S	83.95±0.80	67.08±0.28	80.91±0.29
Sparse Ratio	80%	80%	80%
w/o TA	48.60±1.83	34.29±1.42	44.93±1.22
w/o SA	56.83±1.45	39.50±0.54	57.65±1.21
w/o SL	38.01±1.08	35.71±0.04	43.58±0.45
UltraTAG-S	57.57±1.38	40.08±0.45	61.05±0.49

Table 8: Performance Comparison with Different Augmentation Texts Generated by LLM and Different Text Aggregator Strategies. ’OT’ original texts, ’+’ means concatenate.

Texts	Cora	CiteSeer	PubMed
Sparse Ratio	0%	0%	0%
OT	89.48±0.56	75.24±1.66	90.62±0.62
OT+Su	90.22±1.23	77.59±0.24	91.89±1.19
OT+KW	90.41±0.89	77.59±1.35	91.84±0.94
OT+SL	89.48±1.78	77.74±0.31	91.99±0.29
OT+SKWSL	90.96±0.45	78.68±0.21	92.41±0.30
Sparse Ratio	20%	20%	20%
OT	88.12±0.34	74.92±0.71	87.78±0.74
OT+Su	88.76±0.67	77.12±1.88	89.12±1.08
OT+KW	88.43±1.12	77.27±0.42	89.48±0.95
OT+SL	88.53±0.98	75.86±1.53	89.38±0.27
OT+SKWSL	88.93±0.74	77.12±0.28	89.88±0.21
Sparse Ratio	50%	50%	50%
OT	82.47±0.21	64.89±1.21	79.54±0.39
OT+Su	82.66±0.76	65.52±1.65	80.78±0.61
OT+KW	83.58±0.43	65.83±0.25	80.65±0.90
OT+SL	83.21±0.65	66.46±0.41	80.78±0.28
OT+SKWSL	83.95±0.80	67.08±0.28	80.91±0.29
Sparse Ratio	80%	80%	80%
OT	54.98±0.91	39.34±0.96	60.40±0.77
OT+Su	57.01±1.77	39.66±0.26	60.83±1.07
OT+KW	57.43±0.32	39.97±0.73	60.78±0.60
OT+SL	57.38±0.68	39.81±0.40	60.70±0.82
OT+SKWSL	57.57±1.38	40.08±0.45	61.05±0.49

Appendix E Robustness Comparison

This section contains the performance of various existing methods for node classification tasks on multiple datasets in multiple sparse scenarios. The results of 20% and 50% sparse ratio are shown in Table 9 and 10, respectively. As shown in the table, under 20% and 50% data sparsity conditions, UltraTAG-S still achieves the best node classification performance and robustness among all existing methods with up to 4.6% and 15.4% performance improvement.

Table 9: Robustness Comparison in Sparse Scenarios with Ratio of 20%, which means nodes’ texts and edges with proportion of 20% are removed randomly to simulate real-world scenario. Optimal performance is in bold and sub-optimal performance is underlined.

Methods	Cora	CiteSeer	PubMed	WikiCS	Instagram	Reddit	Elo-Photo
MLP	40.70±3.10	48.37±5.08	59.59±4.25	42.27±4.84	62.18±3.75	54.68±2.05	48.51±2.31
GCN	75.50±3.22	64.86±1.31	75.57±5.08	72.33±3.94	60.77±4.89	52.61±2.05	70.85±4.15
GAT	68.67±2.06	65.64±1.01	72.93±0.92	68.10±0.68	64.90±0.35	57.43±1.24	61.05±0.85
GCNII	71.77±0.88	67.43±0.61	73.75±0.79	68.54±0.84	64.68±0.35	61.08±0.36	58.58±0.84
GraphSAGE	73.80±3.29	60.09±2.06	73.39±5.02	66.50±2.95	56.92±7.35	59.03±1.34	63.06±8.04
BERT	69.37±0.32	63.87±0.14	83.22±0.01	61.35±0.69	63.87±0.11	57.31±0.23	65.48±0.00
DeBERTa	58.21±8.94	53.53±1.96	82.78±0.35	45.93±3.86	61.65±1.07	56.20±3.48	65.34±0.45
RoBERTa	69.88±0.72	65.13±0.41	83.37±0.03	60.98±0.45	64.84±0.17	50.02±0.04	64.89±0.08
GLEM	85.71±2.01	68.71±1.54	82.36±0.37	72.59±3.01	63.37±0.18	55.82±0.21	72.75±1.94
SimTeG	82.34±0.74	70.10±0.60	85.65±0.41	71.33±0.13	62.46±0.62	60.28±0.35	75.46±0.28
TAPE	87.78±0.53	71.11±0.39	87.25±0.75	78.97±0.22	62.17±0.92	61.08±0.66	80.81±0.41
ENGINE	86.27±0.67	73.70±0.33	88.10±0.13	79.43±0.25	65.53±0.22	61.45±0.38	80.34±0.09
UltraTAG-S	88.93±0.74	77.12±0.28	89.88±0.21	81.72±0.18	66.08±0.70	61.61±0.12	83.17±0.06

Table 10: Robustness Comparison in Sparse Scenarios with Ratio of 50%, which means nodes’ texts and edges with proportion of 50% are removed randomly to simulate real-world scenario. Optimal performance is in bold and sub-optimal performance is underlined.

Methods	Cora	CiteSeer	PubMed	WikiCS	Instagram	Reddit	Elo-Photo
MLP	36.35±2.90	40.38±1.68	49.14±4.41	34.29±3.24	62.65±2.18	52.09±1.25	46.57±1.75
GCN	64.61±2.81	48.71±2.16	63.71±0.91	64.12±9.69	58.96±9.54	55.48±3.03	62.61±3.70
GAT	60.63±1.80	51.50±0.61	66.82±0.80	64.60±0.37	64.51±0.23	56.56±2.00	59.32±0.92
GCNII	62.36±0.87	49.66±0.73	67.01±0.67	63.62±0.41	64.08±0.23	59.27±1.15	53.77±1.58
GraphSAGE	54.76±3.89	47.68±0.84	64.74±5.26	64.56±1.17	62.09±3.04	60.14±0.83	62.61±2.51
BERT	55.58±0.24	48.08±0.07	66.28±0.26	45.21±0.81	63.50±0.09	54.35±0.06	57.41±0.04
DeBERTa	34.41±6.99	43.81±3.29	65.75±0.31	32.91±0.73	63.73±0.39	50.90±1.65	57.39±0.19
RoBERTa	53.09±0.40	44.32±1.02	66.28±0.04	42.73±2.54	63.07±0.18	50.97±1.77	57.25±0.00
GLEM	64.84±1.63	53.43±0.25	70.51±1.95	67.07±3.08	62.43±0.21	53.85±0.08	65.25±1.58
SimTeG	72.06±0.59	58.11±0.40	76.00±0.55	65.34±0.46	61.55±0.79	59.84±0.67	67.76±0.45
TAPE	78.73±0.34	54.31±0.78	77.18±0.46	73.62±0.63	61.40±0.26	60.18±0.19	76.21±0.69
ENGINE	70.85±0.48	56.30±0.12	75.42±0.15	71.72±0.47	64.74±0.05	60.18±0.18	73.40±0.23
UltraTAG-S	83.95±0.80	67.08±0.28	80.91±0.29	77.45±0.33	65.61±0.12	60.34±0.21	79.21±0.06