¹¹institutetext: Institute of Theoretical Informatics, Karlsruhe Institute of Technology,
Kaiserstr. 12, 76131 Karlsruhe, Germany
¹¹email: [email protected]²²institutetext: Institute of Nanotechnology, Karlsruhe Institute of Technology,
Kaiserstr. 12, 76131 Karlsruhe, Germany
²²email: [email protected]

Improving Counterfactual Truthfulness for Molecular Property Prediction through Uncertainty Quantification

Jonas Teufel 11 0000-0002-9228-9395 Annika Leinweber 11
Pascal Friederich^, 1122 0000-0003-4465-1465

Abstract

Explainable AI (xAI) interventions aim to improve interpretability for complex black-box modes, not only to improve user trust but also as a means to extract scientific insights from high-performing predictive systems. In molecular property prediction, counterfactual explanations offer a way to understand predictive behavior by highlighting which minimal perturbations in the input molecular structure cause the greatest deviation in the predicted property. However, such explanations only allow for meaningful scientific insights if they reflect the distribution of the true underlying property—a feature we define as counterfactual truthfulness. To enhance truthfulness, we propose the integration of uncertainty estimation techniques to filter high-uncertainty counterfactuals. Through computational experiments with synthetic and real-world datasets, we demonstrate that combining traditional deep ensembles and mean variance estimation can substantially reduce average and maximum model error for out-of-distribution settings and especially increase counterfactual truthfulness. Our results highlight the importance of incorporating uncertainty estimation into counterfactual explainability, especially considering the relative effectiveness of low-effort strategies such as model ensembles.

Keywords:

Counterfactual Explanations Truthfulness Graph Neural Networks Uncertainty Estimation Molecular Property Prediction

1 Introduction

Refer to caption — Figure 1: a Truthful explanations should not only reflect the model’s behavior but also the properties of the underlying true data distribution. b Uncertainty quantification methods predict an additional uncertainty value as an approximation of the model’s prediction error. By filtering high-uncertainty elements, it is possible to reduce the cumulative error and, by extension, the fraction of truthful counterfactuals of the remaining set.

Recent advances in the study of artificial intelligence (AI) have revolutionized various branches of society, industry, and science. Despite their numerous advantages, the opaque black-box nature of modern AI methods remains a challenge. Although complex neural network models often display superior predictive performance, their inner workings remain largely intransparent to humans. Explainable AI (xAI) aims to address these shortcomings by developing methods to better understand the inner workings of these complex models.

Traditionally, xAI methods are meant to improve trust in human-AI relationships, provide tools for model debugging, and ensure regulatory compliance [9]. More recently, xAI has been proposed as a potential source of new scientific insight [51, 24, 41, 40]. This potential of gaining new insights primarily concerns tasks about which little to no prior human knowledge exists. By elucidating the behavior of high-performing models in complex property prediction tasks, xAI can offer insights not only into the model’s behavior but, by extension, into the underlying rules and relationships governing the data itself. However, to gain meaningful insights, the given explanations must be truthful regarding this underlying data distribution. However, for these explanations to yield meaningful insights, they must accurately reflect the true data distribution. This imposes a more stringent requirement for explanations: they must be valid not only in terms of the model’s behavior but also with respect to the predicted property itself.

In this work, we explore counterfactual explainability within chemistry and materials science—a domain where insights derived from XAI would have a substantial impact to accelerate scientific discovery. In short, counterfactual explanations locally explain the model’s behavior by constructing multiple "what if?" scenarios of minimally perturbed input configurations that cause large deviations in the model’s prediction. By itself, a counterfactual only has to explain the model’s behavior, regardless if that behavior reflects the underlying property, causing significant conceptual overlap between the counterfactuals and adversarial examples [10]. As an extension, we define a truthful counterfactual as one that satisfies constraints regarding the model and the underlying ground truth—causing a large deviation of the prediction while maintaining low prediction error (see Figure 1).

Given the general unavailability of ground truth labels for counterfactual samples, we propose uncertainty quantification methods as a means to approximate prediction error and ultimately improve overall counterfactual truthfulness by filtering high-uncertainty explanations. We empirically investigate various common methods of uncertainty quantification and find that an ensemble of mean-variance estimators (MVE) yields the greatest improvement of relative model error and can substantially improve counterfactual truthfulness. Qualitative results affirm these findings, showing that uncertainty-based filtering removes unlikely molecular configurations that lie outside the training distribution. Our results underscore the potential benefits of integrating uncertainty estimation into explainability methods, such as counterfactual explanations.

2 Related Work

2.0.1 Graph Counterfactual Explanations.

Insights from social science indicate that humans prefer explanations to be contrastive—to explain why something happened instead of something else [26]. Counterfactuals aim to provide such contrastive explanations by constructing hypothetical "what if?" scenarios to show which small perturbations to a given input sample would have resulted in a significant deviation from the original prediction outcome.

While Verma et al. [44] present an extensive general review on the topic of counterfactual explanations across different data modalities, Prado-Romero et al. [30] explore specifically counterfactual explanations in the graph processing domain. The authors find that the existing approaches can be categorized by which kinds of perturbations to the input graph are considered. Many existing methods create perturbations using masking strategies on the node-, edge- or feature-level in which masks are optimized to maximize output deviations [23, 3, 39]. However, masking-based strategies often yield uninformative explanations for molecular property prediction. In this context, it is more insightful to perturb the molecular graph by adding or removing bonds and atoms [38]. Some authors successfully adopt such approaches for molecular property predictions [46, 27, 29]. One particular difficulty for these kinds of approaches is the necessity of including domain knowledge to ensure that modifications result in a valid graph structure (e.g. chemically feasible molecules). In one example, Numeroso and Bacciu [29], propose to train an external reinforcement learning agent to propose suitable graph modifications resulting in counterfactual candidates for molecular property predictions. In this case, the authors also introduce domain knowledge by restricting the action space of the agent to chemically feasible modifications.

2.0.2 Uncertainty Quantification.

Predictive machine learning models often encounter uncertainty from various sources, including, for example, inherent measurement noise (aleatoric uncertainty) or regions of the input space insufficiently covered in the training distribution (epistemic uncertainty). Consequently, a model’s predictions may be more accurate for some input samples than for others. Uncertainty quantification methods aim to measure this variability, identifying those samples that a model can predict with greater confidence [11, 1].

Similarly to the broader field of xAI, one aim of uncertainty quantification is to improve user trust by indicating the reliability of a prediction [35]. Beyond uncertainty quantification for target predictions, Longo et al. [22] propose to introduce elements of uncertainty estimation on the explanation level as well.

Traditionally used methods for uncertainty quantification include the joint prediction of a distribution’s mean and variance (MVE) [28], assessing the variance between the predictions of a Deep Ensemble [16] and using bayesian neural networks (BNNs) [42, 4, 12] which aim to directly predict an output distribution rather than individual values. More recent alternatives include stochastic weight averaging gaussians (SWAG) [25] and the idea of Repulsive Ensembles [7, 43] as an extension to Deep Ensembles built on the general framework of particle based variational inference (ParVI) [21, 20] introducing explicit member diversification.

In the domain of molecular property prediction, Hirschfeld et al. [13] and Scalia et al. [32] independently investigate the performance of various traditional uncertainty quantification methods across many standard property prediction datasets. Busk et al. specifically investigate uncertainty quantification using an ensemble of graph neural networks [6].

2.0.3 Uncertainty Quantification and Counterfactuals.

Using xAI to gain new insights into the underlying properties of the data distribution requires the given explanations to be truthful regarding the true property values. In the same context, Freiesleben [10] addresses the conceptual distinction between counterfactual explanations and adversarial examples. Although essentially based on the same optimization objective, the author argues that adversarial examples necessitate a misprediction while counterfactual explanations should be different—yet still correct.

While uncertainty quantification in the context of counterfactual explanations remains largely unexplored, we find Delaney et al. [8] to use uncertainty quantification methods as a possible measure to increase counterfactual reliability for image classification tasks. In terms of UQ interventions, the authors explore Trust Scores and Monte Carlo dropout, finding Trust Scores to be an effective measure. Schut et al. [33] propose the direct optimization of an ensemble-based uncertainty measure as a secondary objective for the generation of realistic counterfactuals for image classification. In another work, Antorán et al. [2] introduce Counterfactual Latent Uncertainty Explanations (CLUE), which is subsequently extended to $\delta$ -CLUE [18] and GLAM-CLUE [19]. Instead of employing uncertainty quantification to improve counterfactual explainability, CLUE aims to use counterfactual explanations to explain uncertainty estimates in probabilistic models—effectively explaining why certain inputs are more uncertain than others.

3 Method

In this work, we explore the generation of counterfactual samples $x^{\prime}$ for molecular property prediction tasks, whereby a graph neural network model is trained to regress a continuous property $y$ of a given molecular graph $x$ . To gain meaningful insights on the underlying property, we specifically focus on the generation of truthful counterfactuals which maximize the prediction difference $|\hat{y}-\hat{y}^{\prime}|$ between original prediction $\hat{y}$ and counterfactual prediction $\hat{y}^{\prime}$ while maintaining a minimal ground truth error $|y^{\prime}-\hat{y}^{\prime}|$ .

3.1 Graph Neural Network Regressors

We represent each molecule as a generic graph structure $x=(\mathcal{N},\mathcal{E},\mathbf{V}^{(0)},\mathbf{U}^{(0)})\in\mathcal{X}$ defined by a set of $N$ node indices $\mathcal{N}=\left\{1,\dots,N\right\}$ and a list of $E$ edge tuples $\mathcal{E}\subseteq{N}\times\mathcal{N}$ where a tuple $(i,j)\in\mathcal{E}$ indicates an edge between nodes $i$ and $j$ . The nodes of this graph structure represent the atoms of the molecule and the edges represent the chemical bonds between the atoms. Furthermore, each graph structure consists of an initial node feature tensor $\mathbf{V}^{(0)}\in\mathbb{R}^{N\times V}$ and an initial edge feature tensor $\mathbf{U}^{(0)}\in\mathbb{R}^{E\times U}$ .

In the case of molecular graphs, the node features contain a one-hot encoding of the atom type, the atomic weight, and the charge, whereas the edge features contain a one-hot encoding of the bond type. For a given dataset of molecules annotated with continuous target values $y\in\mathbb{R}$ , the aim is to train a graph neural network regressor

f_{\theta}:\quad\mathcal{X}\rightarrow\mathbb{R};\quad(\mathcal{N},\mathcal{E}% ,\mathbf{V}^{(0)},\mathbf{U}^{(0)})\mapsto\hat{y}

(1)

with learnable parameters $\theta$ to find an optimal set of parameters

\theta^{\ast}=\arg\min_{\theta}\sum_{x\in\mathcal{X}_{\text{}}}(y-f_{\theta}(x% ))^{2}

(2)

that minimizes the mean-squared error between the predicted value $\hat{y}$ and target $y$ value.

3.2 Molecular Counterfactual Generation

Counterfactual explanations map a model’s local decision boundary by producing a set of minimally perturbed input instances that induce maximal predictive divergence, thereby revealing which kinds of modifications the model is especially sensitive toward.

Given the combination $(x,\hat{y})$ of an original input element $x$ and its corresponding model prediction $\hat{y}$ , a counterfactual sample $(x^{\prime},\hat{y}^{\prime})$ consists of input samples $x^{\prime}$ which are minimally

\underset{x^{\prime}}{\min}\;\text{dist}(x^{\prime},x)

(3)

different from the original input. At the same time, these minimally perturbed input samples should cause a large deviation

\underset{x^{\prime}}{\max}\;\text{dist}(\hat{y}^{\prime},\hat{y})

(4)

in the model’s prediction.

We generate counterfactual samples according to the given constraints by adopting a procedure similar to that presented by Numeroso and Bacciu [29]. However, we omit the training of a reinforcement learning agent to induce the local changes of the molecular structure and opt for a complete enumeration of the entire $k$ -edit neighborhood instead. Due to the limited number of chemically valid modifications and the relatively small size of molecular graphs, we find it computationally feasible to generate all possible modifications to a given input molecule $x$ . As possible modifications, we consider the addition, deletion, and substitution of individual atoms and bonds that satisfy the constraints of atomic valence. Subsequently, the predictive model $f_{\theta}$ is used to obtain the predicted values for all the perturbed graph structures. The structures are then ranked according to the mean absolute prediction difference

\text{dist}(\hat{y}^{\prime},\hat{y})=|\hat{y}^{\prime}-\hat{y}|

(5)

regarding the original prediction $\hat{y}$ . We finally choose the 10 elements with the highest prediction difference to be presented as counterfactual explanations.

At this point, it is worth noting that other possible variations of choosing counterfactual explanations exist. Instead of using the criteria of absolute distance, depending on the use case, it might make sense to select counterfactuals only among those samples with monotonically higher or lower predicted values.

3.3 Counterfactual Truthfulness

A counterfactual explanation $x^{\prime}$ has to be a minimal perturbation of the original sample $x$ while causing a large deviation $|\hat{y}-\hat{y}^{\prime}|$ in the model’s prediction. To gain meaningful insight from such counterfactual explanations and to distinguish them from mere adversarial examples [10], we impose the additional restriction of truthfulness. We define a truthful counterfactual to additionally maintain a low error $|y^{\prime}-\hat{y}^{\prime}|$ regarding its ground truth label $y^{\prime}$ .

For classification problems, we would understand a truthful counterfactual not only to flip the predicted label but to also correctly be associated with that label. For the regression case, there may exist various equally useful definitions of counterfactual truthfulness. In this context, we define a regression counterfactual as truthful if its ground truth error interval does not overlap with the error interval of the original prediction (see Figure 2). This definition ensures that there is at least some predictive divergence with the predicted directionality.

For a given original sample, its absolute ground truth error

\epsilon=|y-\hat{y}|

(6)

is calculated as the absolute difference of the true value $y$ and the predicted label $\hat{y}$ . The ground truth error

\epsilon^{\prime}=|y^{\prime}-\hat{y}^{\prime}|

(7)

of a counterfactual sample $x^{\prime}$ can be calculated accordingly. We subsequently define the truthfulness

\mathrm{tr}(x^{\prime})=\begin{cases}1&[y^{\prime}-\epsilon^{\prime},y^{\prime% }+\epsilon^{\prime}]\cap[y-\epsilon,y+\epsilon]=\emptyset\\ 0&\text{otherwise}\end{cases}

(8)

of an individual counterfactual as a binary property which is fulfilled if its ground truth error interval does not overlap with the error interval of the original sample.

Beyond the truthfulness of individual counterfactual samples, we are primarily interested in the average truthfulness across a whole set $\mathcal{X}^{\prime}\subset\mathcal{X}$ of counterfactuals. We, therefore, define the relative truthfulness

\mathrm{Tr}(\mathcal{X}^{\prime})=\frac{1}{|\mathcal{X}^{\prime}|}\sum_{x^{% \prime}\in\mathcal{X}^{\prime}}\mathrm{tr}(x^{\prime})\quad\in[0,1]

(9)

for a set $\mathcal{X}^{\prime}$ of counterfactuals as the ratio of individual truthful counterfactuals it contains.

At this point, it should be noted that evaluating counterfactual truthfulness proves difficult. Since the generated counterfactual samples generally aren’t contained in existing datasets, evaluating the truthfulness would not only require ground truth labels but rather a ground truth oracle. Consequently, truthfulness can only be evaluated for a small selection of tasks for which such an oracle exists.

3.4 Error Reduction through Uncertainty Thresholding

For the given definition of truthfulness, one viable method of improving the relative truthfulness $\mathrm{Tr}(\mathcal{X}^{\prime})$ is to filter counterfactuals with especially large error intervals. Since it is generally impossible to infer the true label, and by extension the truthfulness, of a given input $x$ in practice, an alternative is to approximate the ground truth error by means of uncertainty quantification (UQ). If the predicted uncertainty proves to be a suitable approximation of the true error, filtering high-uncertainty counterfactuals should have the same effect of improving the relative truthfulness.

This objective can be framed as an overall reduction of the cumulative error

\Gamma^{g}=g\left(\left\{|\hat{y}_{i}-y_{i}|\,:\,x_{i}\in\mathcal{X}^{\circ}% \right\}\right)

(10)

for a given set $\mathcal{X}^{\circ}\subset\mathcal{X}$ of input elements, where $g(\cdot)$ is some function that accumulates individual error values (e.g. mean, median, max).

In the context of uncertainty quantification, each sample $x$ is additionally assigned a predicted uncertainty value $\sigma^{2}$ . Ideally, a high uncertainty value indicates a potential error in the model prediction, while a low value indicates the prediction to be likely correct. By filtering individual samples with high predicted uncertainties, it should, therefore, be possible to reduce the cumulative error $\Gamma^{g}$ among the remaining elements. For this purpose, we can define the absolute cumulative error

\Gamma^{g}(\xi)=g\left(\left\{|\hat{y}_{i}-y_{i}|\,:\,x_{i}\in\mathcal{X}^{% \circ}\;|\;\frac{\sigma^{2}_{i}}{\sigma^{2}_{max}}<\xi\right\}\right)

(11)

as a function of the relative uncertainty threshold $\xi\in[0,1]$ used for the filtering.

This definition of cumulative error faces two issues. Firstly, values of the cumulative error will strongly depend on the specific uncertainty threshold $\xi$ that was chosen. Secondly, the absolute error scales will be vastly different between different tasks and model performances, therefore not being comparable. Consequently, we propose the area under the uncertainty error reduction curve (UER-AUC) as a metric to assess the potential for uncertainty filtering-based error reduction that is comparable across different error scales. To compute the metric, we define the relative cumulative error reduction

\Delta\Gamma^{g}_{rel}(\xi)=\frac{\Gamma^{g}(1)-\Gamma^{g}(\xi)}{\max_{\xi}% \Gamma^{g}(\xi)}\;\in[0,1]

(12)

which is a value in the range $[0,1]$ , where 0 indicates no error reduction while 1 indicates a 100% error reduction. We finally define the $\text{UER-AUC}_{g}$ as the area under the curve of the relative error reduction $\Delta\Gamma^{g}_{rel}(\xi)$ as a function of the relative uncertainty threshold $\xi$ . Consequently, the proposed metric is independent of any specific threshold and comparable across different error ranges as both the uncertainty threshold $\xi$ and the relative error reduction $\Delta\Gamma^{g}_{rel}$ are normalized to the range $[0,1]$ .

In terms of accumulation functions $g$ , we primarily investigate the mean and the maximum, resulting in the two metrics $\text{UER-AUC}_{\text{mean}}$ and $\text{UER-AUC}_{\text{max}}$ . Figure 2a illustrates a simple intuition about these metrics: A perfect correlation between uncertainty and model error will result in a UER-AUC of 0.5. Likewise, a UER-AUC of 0 would be the result of a non-existent correlation between uncertainty and error.

4 Computational Experiments

Computational experiments are structured in two major parts: In the first part, we systematically investigate the general error reduction potential of uncertainty estimation methods for different graph neural network architectures, different uncertainty estimation methods, various out-of-distribution settings, and a range of different datasets. In the second part, we consider the use of uncertainty quantification methods in the context of counterfactual explanations and their effect on overall counterfactual truthfulness as previously defined in Section 3.

4.1 Uncertainty Quantification Methods and Metrics

4.1.1 Uncertainty Quantification Methods.

As part of the computational experiments, we compare the following uncertainty quantification methods.

Deep Ensembles (DE).

We train 3 separate models with bootstrap aggregation, whereby the training data is sampled with replacement. The overall prediction is subsequently obtained as the mean of the individual model outputs, while the standard deviation of the individual predictions is used as an estimate of the uncertainty.

Mean Variance Estimation (MVE).

The base model architecture is augmented to predict not only the target value $\hat{y}$ but also an uncertainty term $\sigma^{2}$ by adding additional fully connected layers to the final prediction network [28]. The training loss

\mathcal{L}_{\mathrm{MVE}}=\frac{1}{N}\sum_{i}^{N}\frac{\texttt{sg}(\sigma_{i}% ^{2\beta})}{2}\cdot\left(\frac{(y_{i}-\hat{y}_{i})^{2}}{\sigma_{i}^{2}}+\log(% \sigma_{i}^{2})\right)

(13)

is augmented to optimize both terms at the same time. We specifically integrate the modification proposed by Seitzer et al. [34], which scales the loss by an additional factor of $\sigma^{2\beta}$ but without contributing to the gradient. Furthermore, during training, we follow best practices described by Sluijterman et al. [36] by using gradient clipping and including an MSE warm-up period before switching to the MVE loss. By combining these measures, we substantially improve the performance degradation otherwise reported in the literature.

Ensemble of mean-variance estimators (DE+MVE).

We combine deep ensembles and mean-variance estimation by constructing an ensemble of 3 independent MVE models, each of which predicts an individual mean and standard deviation, as proposed by Busk et al. [6]. The total uncertainty

\sigma^{2}=\frac{1}{2}\left(\sigma^{2}_{\mathrm{DE}}+\bar{\sigma}^{2}_{\mathrm% {MVE}}\right)

(14)

is calculated as the average of the ensemble uncertainty and the mean MVE uncertainty.

Table 1: Test set results of 5 independent repetitions of computational experiments on the ClogP dataset regarding different model architectures and uncertainty quantification methods. Normal case numbers are the average result, and lower case gray numbers are the standard deviation. For each combination of model and UQ method, the best results are highlighted in bold, and the second-best results are underlined.

Model	UQ Method	$R^{2}\uparrow$	$\rho\uparrow$	$\underset{mean}{\text{UER-AUC}}\uparrow$	$\underset{max}{\text{UER-AUC}}\uparrow$	$\text{RLL}\uparrow$
—	Random	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.01{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$0.01{\color[rgb]{.5,.5,.5}\pm{0.04}}$	$0.10{\color[rgb]{.5,.5,.5}\pm{0.10}}$	–
GCN	DE	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.41{\color[rgb]{.5,.5,.5}\pm{0.17}}$	$0.21{\color[rgb]{.5,.5,.5}\pm{0.06}}$	$0.36{\color[rgb]{.5,.5,.5}\pm{0.18}}$	$0.75{\color[rgb]{.5,.5,.5}\pm{0.02}}$
	MVE	$0.99{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$0.45{\color[rgb]{.5,.5,.5}\pm{0.08}}$	$0.20{\color[rgb]{.5,.5,.5}\pm{0.04}}$	$\mathbf{0.63}{\color[rgb]{.5,.5,.5}\pm{0.18}}$	$\underline{0.77}{\color[rgb]{.5,.5,.5}\pm{0.03}}$
	DE+MVE	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$\mathbf{0.55}{\color[rgb]{.5,.5,.5}\pm{0.10}}$	$\mathbf{0.25}{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$\underline{0.58}{\color[rgb]{.5,.5,.5}\pm{0.20}}$	$\mathbf{0.78}{\color[rgb]{.5,.5,.5}\pm{0.00}}$
	SWAG	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$\underline{0.50}{\color[rgb]{.5,.5,.5}\pm{0.08}}$	$\underline{0.21}{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$0.50{\color[rgb]{.5,.5,.5}\pm{0.23}}$	$0.56{\color[rgb]{.5,.5,.5}\pm{0.11}}$
	TS_eucl.	$0.99{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$0.15{\color[rgb]{.5,.5,.5}\pm{0.17}}$	$0.15{\color[rgb]{.5,.5,.5}\pm{0.09}}$	$0.43{\color[rgb]{.5,.5,.5}\pm{0.26}}$	$0.69{\color[rgb]{.5,.5,.5}\pm{0.10}}$
	TS_tanim.	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.15{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.11{\color[rgb]{.5,.5,.5}\pm{0.04}}$	$0.12{\color[rgb]{.5,.5,.5}\pm{0.10}}$	$0.39{\color[rgb]{.5,.5,.5}\pm{0.33}}$
GATv2	DE	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$\underline{0.51}{\color[rgb]{.5,.5,.5}\pm{0.11}}$	${0.22}{\color[rgb]{.5,.5,.5}\pm{0.04}}$	${0.63}{\color[rgb]{.5,.5,.5}\pm{0.28}}$	$\underline{0.73}{\color[rgb]{.5,.5,.5}\pm{0.05}}$
	MVE	$0.98{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$0.48{\color[rgb]{.5,.5,.5}\pm{0.08}}$	$\underline{0.28}{\color[rgb]{.5,.5,.5}\pm{0.06}}$	$\underline{0.72}{\color[rgb]{.5,.5,.5}\pm{0.15}}$	$0.72{\color[rgb]{.5,.5,.5}\pm{0.08}}$
	DE+MVE	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$\mathbf{0.64}{\color[rgb]{.5,.5,.5}\pm{0.15}}$	$\mathbf{0.34}{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$\mathbf{0.75}{\color[rgb]{.5,.5,.5}\pm{0.09}}$	$\mathbf{0.82}{\color[rgb]{.5,.5,.5}\pm{0.03}}$
	SWAG	$0.99{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.49{\color[rgb]{.5,.5,.5}\pm{0.16}}$	$0.21{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$0.61{\color[rgb]{.5,.5,.5}\pm{0.21}}$	$-0.06{\color[rgb]{.5,.5,.5}\pm{0.47}}$
	TS_eucl.	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.07{\color[rgb]{.5,.5,.5}\pm{0.04}}$	$0.17{\color[rgb]{.5,.5,.5}\pm{0.07}}$	$0.59{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.64{\color[rgb]{.5,.5,.5}\pm{0.01}}$
	TS_tanim.	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.20{\color[rgb]{.5,.5,.5}\pm{0.04}}$	$0.13{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$0.10{\color[rgb]{.5,.5,.5}\pm{0.07}}$	$0.59{\color[rgb]{.5,.5,.5}\pm{0.06}}$
GIN	DE	$0.99{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$\underline{0.62}{\color[rgb]{.5,.5,.5}\pm{0.17}}$	$\underline{0.27}{\color[rgb]{.5,.5,.5}\pm{0.06}}$	$\underline{0.70}{\color[rgb]{.5,.5,.5}\pm{0.11}}$	$\mathbf{0.80}{\color[rgb]{.5,.5,.5}\pm{0.04}}$
	MVE	$0.99{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$0.48{\color[rgb]{.5,.5,.5}\pm{0.11}}$	$0.22{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.56{\color[rgb]{.5,.5,.5}\pm{0.22}}$	$0.75{\color[rgb]{.5,.5,.5}\pm{0.05}}$
	DE+MVE	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$\mathbf{0.63}{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$\mathbf{0.29}{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$\mathbf{0.70}{\color[rgb]{.5,.5,.5}\pm{0.15}}$	$\underline{0.78}{\color[rgb]{.5,.5,.5}\pm{0.01}}$
	SWAG	$0.98{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$0.58{\color[rgb]{.5,.5,.5}\pm{0.20}}$	$0.23{\color[rgb]{.5,.5,.5}\pm{0.07}}$	$0.58{\color[rgb]{.5,.5,.5}\pm{0.08}}$	$0.02{\color[rgb]{.5,.5,.5}\pm{0.43}}$
	TS_eucl.	$0.99{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.15{\color[rgb]{.5,.5,.5}\pm{0.12}}$	$0.17{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.45{\color[rgb]{.5,.5,.5}\pm{0.22}}$	$0.64{\color[rgb]{.5,.5,.5}\pm{0.03}}$
	TS_tanim.	$0.99{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.17{\color[rgb]{.5,.5,.5}\pm{0.08}}$	$0.13{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.11{\color[rgb]{.5,.5,.5}\pm{0.11}}$	$0.52{\color[rgb]{.5,.5,.5}\pm{0.12}}$

Stochastic Weight Averaging Gaussian (SWAG).

The training process is augmented to store snapshots of the model weights during the last 25 epochs. This history of model weights is then used to calculate a mean weight vector $\mathbf{\mu}_{\theta}$ and a covariance matrix $\mathbf{\Sigma}_{\theta}$ such that a new set of model weights can approximately be obtained by drawing from a gaussian distribution $\theta\sim\mathcal{N}(\mu_{\theta},\Sigma_{\theta})$ . During inference, we sample 50 distinct model weights from this distribution and obtain the target value prediction as the mean of the individual predictions and an uncertainty estimate as the standard deviation.

Trust Scores (TS).

Unlike the previously described UQ methods, trust scores are independent of the predictive model and provide an uncertainty estimate based directly on the training data [8]. Originally introduced for classification problems, the trust score for a given input element $x$ is calculated as the fraction

T=\frac{\mathrm{dist}(x,x_{s})}{\mathrm{dist}(x,x_{o})}

(15)

between the distances of the closest training element $x_{s}$ of the same class and the closest training element $x_{o}$ of a different class. We adapt this approach for regression tasks by using the distance to the closest element. This definition relies on the existence of a suitable distance metric $\mathrm{dist}(x_{i},x_{j})$ between two input elements. In this study, we examine two distance metrics for comparing input elements. The first is the Tanimoto distance, which is calculated as the Jaccard distance between two Morgan fingerprint representations of two molecules. The second is the Euclidean distance, which is measured between the graph embeddings generated by an intermediate layer of the graph neural network models.

Uncertainty Calibration.

After training, we apply uncertainty calibration to each UQ method to align the predicted uncertainties with the scale of the actual prediction errors. For this purpose, we use a held-out validation set containing 10% of the data to subsequently fit an isotonic regression model.

4.1.2 Uncertainty Quantification Metrics.

We evaluate the aforementioned UQ methods with the following metrics.

Uncertainty-Error Correlation $\rho$ .

The Pearson correlation coefficient between the absolute prediction errors $|\hat{y}-y|$ and the predicted uncertainties $\sigma^{2}$ on the elements of the test set.

Error Reduction Potential UER-AUC.

As described in Section 3, the UER-AUC is the area under the curve that maps relative error reduction to relative uncertainty thresholds. For each uncertainty threshold, all elements with higher predicted uncertainty are omitted from the test set. The relative error reduction describes the reduction of the cumulative error of the remaining elements relative to the full set.

Relative Log Likelihood RLL.

Following the work of Kellner and Ceriotti [14] we use the Relative Log Likelihood

\mathrm{RLL}=\frac{\sum_{i}\mathrm{NLL}(\hat{y}_{i}-y_{i},\sigma^{2}_{i})-% \mathrm{NLL}(\hat{y}_{i}-y_{i},\mathrm{RMSE})}{\sum_{i}\mathrm{NLL}(\hat{y}_{i% }-y_{i},|\hat{y}_{i}-y_{i}|)-NLL(\hat{y}_{i}-y_{i},\mathrm{RMSE})}

(16)

which standardizes the arbitrary range of the Negative Log Likelihood

\mathrm{NLL}(\Delta y,\sigma^{2})=\frac{1}{2}\left(\frac{\Delta y^{2}}{\sigma^% {2}}+\log 2\pi\sigma^{2}\right)

(17)

into a more interpretable range $(-\infty,1]$ .

4.2 Experiments on Error Reduction Potential

4.2.1 Impact of GNN Model and UQ Method on Error Reduction.

In this first experiment, we evaluate the impact of the model architecture and uncertainty quantification method on uncertainty-based error reduction. The experiment is based on the ClogP dataset, which consists of roughly 11k small molecules annotated with values of Crippen’s logP [47] calculated by RDKit [17]. This logP value is an algorithmically calculated and deterministic property—making it possible to near-perfectly regress it with machine learning models.

In terms of model choice, we compare three standard GNN architectures based on the GCN [15], GATv2 [5], and GIN [50] layer types, respectively. For each repetition of the experiment, we randomly choose 10% of the dataset as the test set, 10% as the calibration set and train the model on the remaining. Therefore, the test set can be considered IID w.r.t. to the training distribution.

Table 1 shows the results of the first experiment. A "Random" baseline, generating random uncertainty values, was included as a control. As expected, this baseline demonstrates negligible error reduction, reflecting the absence of correlation between assigned uncertainty and prediction error. In contrast, the remaining uncertainty quantification methods exhibit varying degrees of error reduction.

Using trust scores with the input-based Tanimoto distance yields substantially worse results than the embedding-based Euclidean distance. Contrary to the encouraging results of Delaney et al. [8], we believe trust scores underperform in this particular application due to the challenge of defining suitable distance metrics on graph-structured data [48].

Overall, we find deep ensembles, mean-variance estimation, and a combination thereof to work the best in terms of error reduction potential, as well as relative log likelihood. Out of these methods, we observe a slight advantage in mean error reduction for the combined ensemble and mean-variance estimation approach.

Moreover, regarding the different model architectures (GCN, GATv2, and GIN), we observe comparable results, both in terms of predictive performance ( $R^{2}\geq 0.99$ ) and in terms of uncertainty quantification methods. Based on these initial observations, model architecture appears to have a limited effect on the relative performance of the uncertainty quantification methods. Consequently, subsequent experiments were conducted using the GATv2 architecture, which exhibited the highest mean error reduction potential in this experiment.

4.2.2 Out-of-distribution Effect on Error Reduction.

The previous experiment examined the error reduction potential on a randomly sampled IID test set of the ClogP dataset. However, a critical aspect of counterfactual analysis involves identifying input perturbations that yield out-of-distribution (OOD) samples. To address this, we established two OOD scenarios for the ClogP dataset. The first, designated OOD-Struct, employs a scaffold split, where the test set comprises molecules with structural scaffolds absent from the training set. The second, OOD-Value, involves a split where the test set contains approximately the 10% most extreme target values, not represented in the training set. Due to the results of the previous experiments, for each scenario, we restrict experiments to use the GATv2 model architecture and compare uncertainty estimation based on ensembles, mean variance estimation, and the combination thereof.

Table 2: Test set results of 5 independent repetitions of computational experiments on the ClogP dataset regarding different out-of-distribution scenarios and uncertainty quantification methods. The best result for each scenario is highlighted in bold, and the second-best result is underlined. Results were obtained based on a GATv2 model architecture.

Scenario	UQ Method	$R^{2}\uparrow$	$\rho\uparrow$	$\underset{mean}{\text{UER-AUC}}\uparrow$	$\underset{max}{\text{UER-AUC}}\uparrow$	$\text{RLL}\uparrow$
OOD	DE	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$\underline{0.45}{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$\mathbf{0.23}{\color[rgb]{.5,.5,.5}\pm{0.04}}$	$\mathbf{0.66}{\color[rgb]{.5,.5,.5}\pm{0.07}}$	$\underline{0.34}{\color[rgb]{.5,.5,.5}\pm{0.06}}$
struct	MVE	$0.99{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.32{\color[rgb]{.5,.5,.5}\pm{0.15}}$	$0.14{\color[rgb]{.5,.5,.5}\pm{0.06}}$	$0.20{\color[rgb]{.5,.5,.5}\pm{0.11}}$	$0.41{\color[rgb]{.5,.5,.5}\pm{0.03}}$
	DE+MVE	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$\mathbf{0.46}{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$\underline{0.21}{\color[rgb]{.5,.5,.5}\pm{0.04}}$	$\underline{0.42}{\color[rgb]{.5,.5,.5}\pm{0.17}}$	$\mathbf{0.55}{\color[rgb]{.5,.5,.5}\pm{0.00}}$
OOD	DE	$0.97{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$\underline{0.62}{\color[rgb]{.5,.5,.5}\pm{0.07}}$	$\mathbf{0.71}{\color[rgb]{.5,.5,.5}\pm{0.10}}$	$\mathbf{0.82}{\color[rgb]{.5,.5,.5}\pm{0.07}}$	$\underline{-3.79}{\color[rgb]{.5,.5,.5}\pm{2.92}}$
value	MVE	$0.99{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.50{\color[rgb]{.5,.5,.5}\pm{0.08}}$	$0.51{\color[rgb]{.5,.5,.5}\pm{0.11}}$	$0.36{\color[rgb]{.5,.5,.5}\pm{0.27}}$	$-8.04{\color[rgb]{.5,.5,.5}\pm{9.72}}$
	DE+MVE	$0.98{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$\mathbf{0.66}{\color[rgb]{.5,.5,.5}\pm{0.04}}$	$\underline{0.67}{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$\underline{0.77}{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$\mathbf{-1.49}{\color[rgb]{.5,.5,.5}\pm{1.09}}$

Table 2 reports the results of the second experiment. For the OOD-Struct scenario, we observe slightly worse results than for the IID case. All three methods show lower correlation, error reduction potential, and relative log likelihood.

Conversely, the OOD-Value scenario exhibited substantial performance gains relative to the IID case. Mean and max error reduction potential cross decisively exceed the $\text{UER-AUC}\geq 0.5$ threshold. Only the negative RLL values indicate poorly calibrated uncertainty estimates with respect to the actual prediction error. This is to be expected since the calibration set was sampled IID while the test set contains previously unseen target values—likely resulting in vastly different error scales.

When comparing the different UQ methods, the ensembles by themselves and the combination of ensembles and MVE seem to perform equally well. For both scenarios, OOD-struct and OOD-value, the ensembles seem to offer higher error reduction potential, while the combination seems to offer better calibrated uncertainty estimates, as indicated by the higher RLL values.

In summary, uncertainty-based filtering demonstrates a moderate error reduction effect on in-distribution data and structural outliers. Notably, the error reduction potential increases substantially under a distributional shift of the target values (see Figure 3). These results provide a foundation for filtering counterfactuals, where perturbations can be expected to create outliers with respect to both structure and target value.

Table 3: Test set results of 5 independent repetitions of computational experiments to evaluate uncertainty-based error reduction on various molecular property prediction datasets. The first row represents the previously introduced deterministic CLogP graph regression task, and the following rows represent various real-world molecular property regression datasets. Results are obtained by a GATv2 graph neural network and uncertainties are estimated by a method combining deep ensembles and mean-variance estimation.

Dataset	Property	$R^{2}\uparrow$	$\rho\uparrow$	$\underset{mean}{\text{UER-AUC}}\uparrow$	$\underset{max}{\text{UER-AUC}}\uparrow$	$\text{RLL}\uparrow$
ClogP	logP	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.58{\color[rgb]{.5,.5,.5}\pm{0.17}}$	$0.27{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.66{\color[rgb]{.5,.5,.5}\pm{0.22}}$	$0.76{\color[rgb]{.5,.5,.5}\pm{0.02}}$
AqSolDB[37]	logS	$0.88{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$0.35{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.24{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$0.26{\color[rgb]{.5,.5,.5}\pm{0.17}}$	$0.45{\color[rgb]{.5,.5,.5}\pm{0.03}}$
Lipop[49]	logD	$0.74{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$0.15{\color[rgb]{.5,.5,.5}\pm{0.06}}$	$0.10{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$0.22{\color[rgb]{.5,.5,.5}\pm{0.12}}$	$0.32{\color[rgb]{.5,.5,.5}\pm{0.02}}$
COMPAS[45]	rel. Ener.	$0.90{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.65{\color[rgb]{.5,.5,.5}\pm{0.04}}$	$0.37{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$0.45{\color[rgb]{.5,.5,.5}\pm{0.11}}$	$0.66{\color[rgb]{.5,.5,.5}\pm{0.03}}$
	GAP	$0.97{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$0.44{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.27{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.59{\color[rgb]{.5,.5,.5}\pm{0.20}}$	$0.71{\color[rgb]{.5,.5,.5}\pm{0.01}}$
QM9[31]	Dip. Mom.	$0.78{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.57{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$0.45{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$0.76{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$0.53{\color[rgb]{.5,.5,.5}\pm{0.00}}$
	HOMO	$0.93{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.54{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$0.23{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$0.61{\color[rgb]{.5,.5,.5}\pm{0.10}}$	$0.63{\color[rgb]{.5,.5,.5}\pm{0.01}}$
	LUMO	$0.99{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.48{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$0.23{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$0.67{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$0.73{\color[rgb]{.5,.5,.5}\pm{0.00}}$
	GAP	$0.97{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.52{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$0.25{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$0.76{\color[rgb]{.5,.5,.5}\pm{0.04}}$	$0.68{\color[rgb]{.5,.5,.5}\pm{0.00}}$

4.2.3 Error Reduction on Real-World Datasets.

Previous experiments were based on the ClogP dataset, which is a deterministically computable property and, therefore, relatively easy to regress. To assess the generalizability of these findings to more complex scenarios, computational experiments were conducted on multiple properties derived from the AqSolDB [37], Lipop [49], COMPAS [45], and QM9 [31] datasets. Based on the results of previous experiments, we use the GATv2 model to predict each property and a combination of ensembles and mean-variance estimation for the uncertainty quantification.

Table 3 presents the results for the real-world property regression datasets. Despite varying levels of predictivity ( $R^{2}\in[0.74,0.99]$ ) for the different datasets, some degree of error reduction can be reported for each one ( $\text{UER-AUC}_{\mathrm{mean}}\in[0.10,0.45]$ ). Notably, the highest error reduction is found for the prediction of the Dipole Moment in the QM9 dataset with a mean error reduction of $\text{UER-AUC}_{\mathrm{mean}}=0.45$ and a max error reduction of $\text{UER-AUC}_{\mathrm{max}}=0.78$ . In contrast, the lowest error reduction can be observed for the prediction of the Lipophilicity with a mean error reduction of only $\text{UER-AUC}_{\mathrm{mean}}=0.10$ .

The extent of error reduction potential does not appear to correlate strongly with the predictive performance of the model, as both the highest and lowest error reductions were associated with models demonstrating lower predictivity ( $R^{2}\approx 0.7$ ). In addition, even models with high predictivity, such as the prediction of the LUMO value ( $R^{2}=0.99$ ), show moderate amounts of error reduction potential ( $\text{UER-AUC}_{\mathrm{mean}}=0.23$ , $\text{UER-AUC}_{\mathrm{max}}=0.67$ ). We hypothesize that the error reduction may be connected to the complexity of the underlying data distribution and the presence of labeling noise. The Lipophilicity dataset, for example, consists of inherently noisy experimental measurements while values for the dipole moment in the QM9 dataset were obtained by more precise DFT simulations.

Overall, the results of this experiment indicate that uncertainty threshold-based filtering can be used as an effective tool to decrease the overall prediction error even for complex properties, which may have been obtained through noisy measurements.

4.3 Experiments on Counterfactual Truthfulness

In the second part of the computational experiments, we investigate the potential of uncertainty-based filtering to improve the overall truthfulness of counterfactual explanations.

4.3.1 Improving Counterfactual Truthfulness.

For this experiment, we use the CLogP dataset, as the underlying property is deterministically calculable for any valid molecular graph. This availability of a ground truth oracle is necessary for the computation of the relative truthfulness as defined in Section 3.3. As before, we use the GATv2 model architecture and investigate the effectiveness of ensembles, mean-variance estimation, and the combination thereof. We split the dataset into a test set (10%), a calibration set (20%), and a train set (70%). All models are fitted with the train set, and uncertainty estimates are subsequently calibrated on the validation set. On the test set, we determine a single uncertainty threshold $\xi_{20}$ such that exactly the 20% elements with the lowest predicted uncertainties remain.

As described in Section 3.2, we generate counterfactual samples by ranking all graphs in a 1-edit neighborhood according to the prediction divergence and choosing the top 10 candidates. This set of counterfactuals is then filtered using the threshold $\xi_{20}$ and examined regarding its relative truthfulness.

The results of this experiment are reported in Table 4. A "Random" baseline was included as a control. As expected, this control’s randomly generated uncertainty values result neither in test set error reduction nor an increase of counterfactual truthfulness. All other uncertainty quantification (UQ) methods demonstrated a moderate potential for error reduction on both the test set and the set of counterfactuals ( $\text{UER-AUC}\geq 0.2$ ). Furthermore, all UQ methods exhibited some capability to increase relative truthfulness when filtering with the uncertainty threshold $\xi_{20}$ . However, it has to be noted that the initial truthfulness in the unfiltered set of counterfactuals is rather high (up to 95%), leaving little room for further improvement. It is important to note, however, that the initial truthfulness in the unfiltered set of counterfactuals was relatively high (up to 95%), limiting the scope for further improvement. Notably, the mean variance estimation model displayed a substantially lower initial truthfulness (0.77), most likely due to its slightly lower predictivity.

In addition to the results for the fixed uncertainty threshold $\xi_{20}$ , Figure 5 visualizes the progression of mean error reduction and truthfulness results across a range of possible uncertainty thresholds. The plots show that for increasingly strict uncertainty thresholds, the relative counterfactual truthfulness also increases near-monotonically, reaching 100% with a small subset of 5% remaining counterfactuals.

In summary, we find that all UQ methods exhibit some capacity to improve the relative counterfactual truthfulness through uncertainty-based filtering. However, the results may be influenced by the already elevated values observed for the unfiltered set. Future work should explore more complex property prediction tasks with lower predictive performance and, consequently, lower starting points of counterfactual truthfulness.

Table 4: Results of 5 independent repetitions of computational experiments on the ClogP dataset to evaluate counterfactual truthfulness using a fixed uncertainty threshold

\xi_{20}

determined on the test set. Results are obtained using a GATv2 graph neural network, and uncertainties are estimated by the combination of Deep Ensembles and MVE. ^†Tr. Init. represents the initial percentage of truthful counterfactuals in the unfiltered set of all counterfactuals. ^‡Tr. Gain. shows the increase in the relative percentage of truthful counterfactuals after filtering according to the uncertainty threshold

\xi_{20}

Method	Test Set		Counterfactuals
	$R^{2}\uparrow$	$\underset{\text{mean}}{\text{UER-AUC}}\uparrow$	$\rho\uparrow$	$\underset{\text{mean}}{\text{UER-AUC}}\uparrow$	$\underset{(\%)}{\text{Tr. Init.}^{\dagger}}\uparrow$	$\underset{(\%)}{\text{Tr. Gain}^{\ddagger}}\uparrow$
Random	$1.00{\color[rgb]{.5,.5,.5}\pm{0.01}}$	$-0.00{\color[rgb]{.5,.5,.5}\pm{0.06}}$	$-0.01{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$-0.02{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$0.95{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$-0.00{\color[rgb]{.5,.5,.5}\pm{0.05}}$
MVE	$0.98{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$0.26{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.53{\color[rgb]{.5,.5,.5}\pm{0.15}}$	$0.23{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.77{\color[rgb]{.5,.5,.5}\pm{0.17}}$	$0.09{\color[rgb]{.5,.5,.5}\pm{0.05}}$
DE	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.27{\color[rgb]{.5,.5,.5}\pm{0.04}}$	$0.45{\color[rgb]{.5,.5,.5}\pm{0.12}}$	$0.20{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.94{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$0.04{\color[rgb]{.5,.5,.5}\pm{0.05}}$
DE+MVE	$1.00{\color[rgb]{.5,.5,.5}\pm{0.00}}$	$0.28{\color[rgb]{.5,.5,.5}\pm{0.03}}$	$0.44{\color[rgb]{.5,.5,.5}\pm{0.16}}$	$0.23{\color[rgb]{.5,.5,.5}\pm{0.05}}$	$0.95{\color[rgb]{.5,.5,.5}\pm{0.02}}$	$0.05{\color[rgb]{.5,.5,.5}\pm{0.02}}$

4.3.2 Qualitative Filtering Results.

Besides a quantitative evaluation of counterfactual truthfulness, some qualitative results of uncertainty-based filtering are illustrated in Figure 4 for two example molecules. Uncertainty estimates are obtained by an ensemble of mean variance estimators. As before, the uncertainty threshold $\xi_{20}$ is chosen such that only 20% of the test set elements with the lowest uncertainty values remain.

For the first molecule, benzoic acid, only the highest-ranked counterfactual candidate A is filtered due to exceeding the uncertainty threshold. This exclusion intuitively makes sense since the added bond between oxygen and nitrogen is an uncommon configuration that is not represented in the underlying dataset and can be considered an out-of-distribution input. In contrast to this expected behavior, the counterfactual candidate B is not filtered but represents an equally uncommon configuration with one carbon being connected to two single-bonded oxygen at the same time. Notably, the model also predicts a significantly incorrect value for this counterfactual candidate.

For the second molecule, aspirin, the four highest-ranked counterfactuals are filtered based on their predicted uncertainty. The exclusion of the counterfactual candidate E also intuitively makes sense since it includes the same uncommon bond between nitrogen and oxygen. Excluded counterfactual candidate D also contains a rather uncommon substructure but, more importantly, is predicted highly inaccurately by the model. In contrast to these cases, the highest ranked counterfactual candidate C is excluded even though the model’s prediction is highly accurate, serving as an example of overly conservative filtering.

Overall, the qualitative examples illustrate that the uncertainty-based filtering can be effective in identifying and removing out-of-distribution input samples and generally inaccurate predictions. However, there are also cases in which the method fails by either failing to filter OOD samples or by being too conservative and filtering perfectly accurate predictions.

5 Discussion

Previous work has investigated the intersection of uncertainty estimation and counterfactual explainability predominantly in the context of image classification [8, 33, 2]. Schut et al. [33], for example, include an ensemble-based uncertainty estimate as a direct objective in the optimization of counterfactual explanations. The authors find this intervention to reduce the likelihood of generating uninformative out-of-distribution samples—or in the words of Freiesleben [10] to steer the generation toward true counterfactual explanations rather than mere adversarial examples.

In our work, we present a distinct perspective to the existing literature, which differs in two important aspects: We focus on (1) regression tasks in the (2) graph processing domain. In image processing, good counterfactual explanations require the modification of multiple pixel values in a semantically meaningful way. This is framed as a non-trivial optimization objective requiring substantial computational effort. In the graph processing domain, however, the limited number of possible graph modifications makes it computationally feasible to search for counterfactual candidates among a full enumeration of all possible perturbations. Consequently, uncertainty quantification does not have to be included in the generation process itself but instead may serve as a simple filter over this set of possible perturbations. Nevertheless, the objective is the same: to use uncertainty quantification methods to present higher-quality counterfactual explanations to the user.

Another key factor is the difference between classification and regression tasks. While classification enables binary assessments of correct and incorrect predictions, regression operates on a continuous error scale, requiring different metrics to assess the impact of uncertainty estimation—motivating our definitions of the UER-AUC and the counterfactual truthfulness.

Consistent with existing literature, our work demonstrates that incorporating uncertainty estimation improves the quality of counterfactual explanations. We specifically find that filtering high-uncertainty elements decreases the average error of the remaining set and increases overall truthfulness—meaning the explanation’s alignment with the underlying ground truth data distribution.

In our experiments, we find no substantial differences in the relative effectiveness of UQ interventions between three common graph neural network architectures (GCN, GATv2, GIN). Regarding the choice of the uncertainty estimation method, we find trust scores [8, 33] to be ill-suited to graph processing applications, most likely due to the unavailability of suitable distance metrics. We furthermore come to similar conclusions as previous authors [32, 13, 6] in that the simple application of model ensembles already proves relatively effective. While we find a combination between ensembles and mean variance estimation to be slightly beneficial on IID data, there seems to be no substantial difference in OOD test scenarios.

However, in this context, it is still important to mention the remaining limitations of this approach grounded in the non-perfect correlation of the prediction errors and estimated uncertainties. While quantitative results show that uncertainty-based filtering has a higher relative likelihood to remove truly high-error samples, qualitative results indicate it still occasionally fails to detect some high-error samples and mistakenly filters valid elements. Depending on the severity of these failure cases, it will largely depend on the concrete application whether the increased truthfulness reasonably justifies the loss of some valid explanations.

6 Conclusion

Counterfactual explanations can deepen the understanding of a complex model’s predictive behavior by illustrating which kinds of local perturbations a model is especially sensitive to. In the scientific domain of chemistry and material science, these explanations are often not only desirable to understand the model’s behavior but by extension to understand the structure-property relationships of the underlying data itself. To use counterfactuals to gain insights about the underlying data, the explanations truthfully must reflect the properties thereof.

In this work, we explore the potential of uncertainty estimation to increase the overall truthfulness of a set of counterfactuals by filtering those elements with particularly high predicted uncertainty. We conduct extensive computational experiments with different methods to investigate the error-reduction potential of various uncertainty estimation methods in different settings. We find that model ensembles provide strong uncertainty estimates in out-of-distribution test scenarios, while a combination of ensembles and mean variance provide the highest error reduction potential for in-distribution settings.

Based on these initial results, we conclude that uncertainty estimation presents a promising opportunity to increase the truthfulness of explanations—to make sure explanations not only represent the model’s behavior but the properties of the underlying data as well. An interesting direction for future research will be to see if uncertainty estimation can be employed equally beneficially to different explanation modalities, such as local attributional and global concept-based explanations.

{credits}

6.0.1 Acknowledgements

This work was supported by funding from the pilot program Core-Informatics of the Helmholtz Association (HGF).

6.0.2 \discintname

The authors have no competing interests to declare that are relevant to the content of this article.

References

[1] Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U. Rajendra Acharya, Vladimir Makarenkov, and Saeid Nahavandi. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243–297, December 2021.
[2] Javier Antor’an, Umang Bhatt, T. Adel, Adrian Weller, and José Miguel Hernández-Lobato. Getting a CLUE: A Method for Explaining Uncertainty Estimates. International Conference on Learning Representations, 2020.
[3] Mohit Bajaj, Lingyang Chu, Zihui Xue, J. Pei, Lanjun Wang, P. C. Lam, and Yong Zhang. Robust Counterfactual Explanations on Graph Neural Networks. ArXiv, July 2021.
[4] Christopher M. Bishop. Bayesian Neural Networks. Journal of the Brazilian Computer Society, 4:61–68, July 1997.
[5] Shaked Brody, Uri Alon, and Eran Yahav. How Attentive are Graph Attention Networks? In International Conference on Learning Representations, February 2022.
[6] Jonas Busk, Peter Bjørn Jørgensen, Arghya Bhowmik, Mikkel N Schmidt, Ole Winther, and Tejs Vegge. Calibrated uncertainty for molecular property prediction using ensembles of message passing neural networks. Machine Learning: Science and Technology, 3(1):015012, December 2021.
[7] Francesco D’ Angelo and Vincent Fortuin. Repulsive Deep Ensembles are Bayesian. In Advances in Neural Information Processing Systems, volume 34, pages 3451–3465. Curran Associates, Inc., 2021.
[8] Eoin Delaney, Derek Greene, and Mark T. Keane. Uncertainty Estimation and Out-of-Distribution Detection for Counterfactual Explanations: Pitfalls and Solutions, July 2021.
[9] Finale Doshi-Velez and Been Kim. Towards A Rigorous Science of Interpretable Machine Learning. arXiv:1702.08608 [cs, stat], March 2017.
[10] Timo Freiesleben. The Intriguing Relation Between Counterfactual Explanations and Adversarial Examples. Minds and Machines, 32(1):77–109, March 2022.
[11] Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, Muhammad Shahzad, Wen Yang, Richard Bamler, and Xiao Xiang Zhu. A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(1):1513–1589, October 2023.
[12] Ethan Goan and Clinton Fookes. Bayesian Neural Networks: An Introduction and Survey. In Kerrie L. Mengersen, Pierre Pudlo, and Christian P. Robert, editors, Case Studies in Applied Bayesian Data Science: CIRM Jean-Morlet Chair, Fall 2018, pages 45–87. Springer International Publishing, Cham, 2020.
[13] Lior Hirschfeld, Kyle Swanson, Kevin Yang, Regina Barzilay, and Connor W. Coley. Uncertainty Quantification Using Neural Networks for Molecular Property Prediction. Journal of Chemical Information and Modeling, 60(8):3770–3780, August 2020.
[14] Matthias Kellner and Michele Ceriotti. Uncertainty quantification by direct propagation of shallow ensembles. Machine Learning: Science and Technology, 5(3):035006, July 2024.
[15] Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations. arXiv, 2017.
[16] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[17] G Landrum. RDKit: Open-source cheminformatics, 2010.
[18] D. Ley, Umang Bhatt, and Adrian Weller. $\delta$ -CLUE: Diverse Sets of Explanations for Uncertainty Estimates. ArXiv, April 2021.
[19] Dan Ley, Umang Bhatt, and Adrian Weller. Diverse, Global and Amortised Counterfactual Explanations for Uncertainty Estimates. Proceedings of the AAAI Conference on Artificial Intelligence, 36(7):7390–7398, June 2022.
[20] Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, and Jun Zhu. Understanding and Accelerating Particle-Based Variational Inference. In International Conference on Machine Learning, July 2018.
[21] Qiang Liu and Dilin Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
[22] Luca Longo, Mario Brcic, Federico Cabitza, Jaesik Choi, Roberto Confalonieri, Javier Del Ser, Riccardo Guidotti, Yoichi Hayashi, Francisco Herrera, Andreas Holzinger, Richard Jiang, Hassan Khosravi, Freddy Lecue, Gianclaudio Malgieri, Andrés Páez, Wojciech Samek, Johannes Schneider, Timo Speith, and Simone Stumpf. Explainable Artificial Intelligence (XAI) 2.0: A manifesto of open challenges and interdisciplinary research directions. Information Fusion, 106:102301, June 2024.
[23] Ana Lucic, Maartje ter Hoeve, Gabriele Tolomei, M. de Rijke, and F. Silvestri. CF-GNNExplainer: Counterfactual Explanations for Graph Neural Networks. In International Conference on Artificial Intelligence and Statistics, February 2021.
[24] Phillip M. Maffettone, Pascal Friederich, Sterling G. Baird, Ben Blaiszik, Keith A. Brown, Stuart I. Campbell, Orion A. Cohen, Rebecca L. Davis, Ian T. Foster, Navid Haghmoradi, Mark Hereld, Howie Joress, Nicole Jung, Ha-Kyung Kwon, Gabriella Pizzuto, Jacob Rintamaki, Casper Steinmann, Luca Torresi, and Shijing Sun. What is missing in autonomous discovery: Open challenges for the community. Digital Discovery, 2(6):1644–1659, 2023.
[25] Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A Simple Baseline for Bayesian Uncertainty in Deep Learning. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[26] Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38, February 2019.
[27] Tri Minh Nguyen, Thomas P Quinn, Thin Nguyen, and Truyen Tran. Explaining Black Box Drug Target Prediction Through Model Agnostic Counterfactual Samples. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(2):1020–1029, March 2023.
[28] D.A. Nix and A.S. Weigend. Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), volume 1, pages 55–60 vol.1, June 1994.
[29] Danilo Numeroso and Davide Bacciu. MEG: Generating Molecular Counterfactual Explanations for Deep Graph Networks. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8, July 2021.
[30] Mario Alfonso Prado-Romero, Bardh Prenkaj, Giovanni Stilo, and Fosca Giannotti. A Survey on Graph Counterfactual Explanations: Definitions, Methods, Evaluation, and Research Challenges. ACM Comput. Surv., 56(7):171:1–171:37, April 2024.
[31] Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, and O. Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1(1):140022, August 2014.
[32] Gabriele Scalia, Colin A. Grambow, Barbara Pernici, Yi-Pei Li, and William H. Green. Evaluating Scalable Uncertainty Estimation Methods for Deep Learning-Based Molecular Property Prediction. Journal of Chemical Information and Modeling, 60(6):2697–2717, June 2020.
[33] L. Schut, Oscar Key, R. McGrath, Luca Costabello, Bogdan Sacaleanu, Medb Corcoran, and Y. Gal. Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties. In International Conference on Artificial Intelligence and Statistics, March 2021.
[34] Maximilian Seitzer, Arash Tavakoli, Dimitrije Antic, and Georg Martius. On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks. In International Conference on Learning Representations. arXiv, 2022.
[35] Dominik Seuss. Bridging the Gap Between Explainable AI and Uncertainty Quantification to Enhance Trustability. ArXiv, May 2021.
[36] Laurens Sluijterman, Eric Cator, and Tom Heskes. Optimal training of Mean Variance Estimation neural networks. Neurocomputing, 597:127929, September 2024.
[37] Murat Cihan Sorkun, Abhishek Khetan, and Süleyman Er. AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Scientific Data, 6(1):143, August 2019.
[38] Hunter Sturm, Jonas Teufel, Kaitlin A. Isfeld, Pascal Friederich, and Rebecca L. Davis. Mitigating Molecular Aggregation in Drug Discovery with Predictive Insights from Explainable AI. 2023.
[39] Juntao Tan, Shijie Geng, Zuohui Fu, Yingqiang Ge, Shuyuan Xu, Yunqi Li, and Yongfeng Zhang. Learning and Evaluating Graph Neural Network Explanations based on Counterfactual and Factual Reasoning. In Proceedings of the ACM Web Conference 2022, pages 1018–1027, April 2022.
[40] Jonas Teufel and Pascal Friederich. Global Concept Explanations for Graphs by Contrastive Learning. In Luca Longo, Sebastian Lapuschkin, and Christin Seifert, editors, Explainable Artificial Intelligence, pages 184–208, Cham, 2024. Springer Nature Switzerland.
[41] Jonas Teufel, Luca Torresi, Patrick Reiser, and Pascal Friederich. MEGAN: Multi-explanation Graph Attention Network. In Luca Longo, editor, Explainable Artificial Intelligence, pages 338–360, Cham, 2023. Springer Nature Switzerland.
[42] Tishby, Levin, and Solla. Consistent inference of probabilities in layered networks: Predictions and generalizations. In International 1989 Joint Conference on Neural Networks, pages 403–409 vol.2, June 1989.
[43] Trung Trinh, Markus Heinonen, Luigi Acerbi, and Samuel Kaski. Input-gradient space particle inference for neural network ensembles. In The Twelfth International Conference on Learning Representations, October 2023.
[44] Sahil Verma, Varich Boonsanong, Minh Hoang, Keegan Hines, John Dickerson, and Chirag Shah. Counterfactual Explanations and Algorithmic Recourses for Machine Learning: A Review. ACM Comput. Surv., 56(12):312:1–312:42, October 2024.
[45] Alexandra Wahab, Lara Pfuderer, Eno Paenurk, and Renana Gershoni-Poranne. The COMPAS Project: A Computational Database of Polycyclic Aromatic Systems. Phase 1: Cata-Condensed Polybenzenoid Hydrocarbons. Journal of Chemical Information and Modeling, 62(16):3704–3713, August 2022.
[46] Geemi P. Wellawatte, Aditi Seshadri, and Andrew D. White. Model agnostic generation of counterfactual explanations for molecules. Chemical Science, 13(13):3697–3705, 2022.
[47] Scott A. Wildman and Gordon M. Crippen. Prediction of Physicochemical Parameters by Atomic Contributions. Journal of Chemical Information and Computer Sciences, 39(5):868–873, September 1999.
[48] Peter Wills and François G. Meyer. Metrics for graph comparison: A practitioner’s guide. PLoS ONE, 15(2):e0228728, February 2020.
[49] Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: A benchmark for molecular machine learning. Chemical Science, 9(2):513–530, January 2018.
[50] Keyulu Xu*, Weihua Hu*, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks? In International Conference on Learning Representations, September 2018.
[51] Carlos Zednik and Hannes Boelsen. Scientific Exploration and Explainable Artificial Intelligence. Minds and Machines, 32(1):219–239, March 2022.