11institutetext: Institute of Theoretical Informatics, Karlsruhe Institute of Technology,
Kaiserstr. 12, 76131 Karlsruhe, Germany
11email: [email protected]
22institutetext: Institute of Nanotechnology, Karlsruhe Institute of Technology,
Kaiserstr. 12, 76131 Karlsruhe, Germany
22email: [email protected]

Improving Counterfactual Truthfulness for Molecular Property Prediction through Uncertainty Quantification

Jonas Teufel 11 0000-0002-9228-9395    Annika Leinweber 11   
Pascal Friederich,
1122 0000-0003-4465-1465
Abstract

Explainable AI (xAI) interventions aim to improve interpretability for complex black-box modes, not only to improve user trust but also as a means to extract scientific insights from high-performing predictive systems. In molecular property prediction, counterfactual explanations offer a way to understand predictive behavior by highlighting which minimal perturbations in the input molecular structure cause the greatest deviation in the predicted property. However, such explanations only allow for meaningful scientific insights if they reflect the distribution of the true underlying property—a feature we define as counterfactual truthfulness. To enhance truthfulness, we propose the integration of uncertainty estimation techniques to filter high-uncertainty counterfactuals. Through computational experiments with synthetic and real-world datasets, we demonstrate that combining traditional deep ensembles and mean variance estimation can substantially reduce average and maximum model error for out-of-distribution settings and especially increase counterfactual truthfulness. Our results highlight the importance of incorporating uncertainty estimation into counterfactual explainability, especially considering the relative effectiveness of low-effort strategies such as model ensembles.

Keywords:
Counterfactual Explanations Truthfulness Graph Neural Networks Uncertainty Estimation Molecular Property Prediction

1 Introduction

Refer to caption
Figure 1: a Truthful explanations should not only reflect the model’s behavior but also the properties of the underlying true data distribution. b Uncertainty quantification methods predict an additional uncertainty value as an approximation of the model’s prediction error. By filtering high-uncertainty elements, it is possible to reduce the cumulative error and, by extension, the fraction of truthful counterfactuals of the remaining set.

Recent advances in the study of artificial intelligence (AI) have revolutionized various branches of society, industry, and science. Despite their numerous advantages, the opaque black-box nature of modern AI methods remains a challenge. Although complex neural network models often display superior predictive performance, their inner workings remain largely intransparent to humans. Explainable AI (xAI) aims to address these shortcomings by developing methods to better understand the inner workings of these complex models.

Traditionally, xAI methods are meant to improve trust in human-AI relationships, provide tools for model debugging, and ensure regulatory compliance [9]. More recently, xAI has been proposed as a potential source of new scientific insight [51, 24, 41, 40]. This potential of gaining new insights primarily concerns tasks about which little to no prior human knowledge exists. By elucidating the behavior of high-performing models in complex property prediction tasks, xAI can offer insights not only into the model’s behavior but, by extension, into the underlying rules and relationships governing the data itself. However, to gain meaningful insights, the given explanations must be truthful regarding this underlying data distribution. However, for these explanations to yield meaningful insights, they must accurately reflect the true data distribution. This imposes a more stringent requirement for explanations: they must be valid not only in terms of the model’s behavior but also with respect to the predicted property itself.

In this work, we explore counterfactual explainability within chemistry and materials science—a domain where insights derived from XAI would have a substantial impact to accelerate scientific discovery. In short, counterfactual explanations locally explain the model’s behavior by constructing multiple "what if?" scenarios of minimally perturbed input configurations that cause large deviations in the model’s prediction. By itself, a counterfactual only has to explain the model’s behavior, regardless if that behavior reflects the underlying property, causing significant conceptual overlap between the counterfactuals and adversarial examples [10]. As an extension, we define a truthful counterfactual as one that satisfies constraints regarding the model and the underlying ground truth—causing a large deviation of the prediction while maintaining low prediction error (see Figure 1).

Given the general unavailability of ground truth labels for counterfactual samples, we propose uncertainty quantification methods as a means to approximate prediction error and ultimately improve overall counterfactual truthfulness by filtering high-uncertainty explanations. We empirically investigate various common methods of uncertainty quantification and find that an ensemble of mean-variance estimators (MVE) yields the greatest improvement of relative model error and can substantially improve counterfactual truthfulness. Qualitative results affirm these findings, showing that uncertainty-based filtering removes unlikely molecular configurations that lie outside the training distribution. Our results underscore the potential benefits of integrating uncertainty estimation into explainability methods, such as counterfactual explanations.

2 Related Work

2.0.1 Graph Counterfactual Explanations.

Insights from social science indicate that humans prefer explanations to be contrastive—to explain why something happened instead of something else [26]. Counterfactuals aim to provide such contrastive explanations by constructing hypothetical "what if?" scenarios to show which small perturbations to a given input sample would have resulted in a significant deviation from the original prediction outcome.

While Verma et al. [44] present an extensive general review on the topic of counterfactual explanations across different data modalities, Prado-Romero et al. [30] explore specifically counterfactual explanations in the graph processing domain. The authors find that the existing approaches can be categorized by which kinds of perturbations to the input graph are considered. Many existing methods create perturbations using masking strategies on the node-, edge- or feature-level in which masks are optimized to maximize output deviations [23, 3, 39]. However, masking-based strategies often yield uninformative explanations for molecular property prediction. In this context, it is more insightful to perturb the molecular graph by adding or removing bonds and atoms [38]. Some authors successfully adopt such approaches for molecular property predictions [46, 27, 29]. One particular difficulty for these kinds of approaches is the necessity of including domain knowledge to ensure that modifications result in a valid graph structure (e.g. chemically feasible molecules). In one example, Numeroso and Bacciu [29], propose to train an external reinforcement learning agent to propose suitable graph modifications resulting in counterfactual candidates for molecular property predictions. In this case, the authors also introduce domain knowledge by restricting the action space of the agent to chemically feasible modifications.

2.0.2 Uncertainty Quantification.

Predictive machine learning models often encounter uncertainty from various sources, including, for example, inherent measurement noise (aleatoric uncertainty) or regions of the input space insufficiently covered in the training distribution (epistemic uncertainty). Consequently, a model’s predictions may be more accurate for some input samples than for others. Uncertainty quantification methods aim to measure this variability, identifying those samples that a model can predict with greater confidence [11, 1].

Similarly to the broader field of xAI, one aim of uncertainty quantification is to improve user trust by indicating the reliability of a prediction [35]. Beyond uncertainty quantification for target predictions, Longo et al. [22] propose to introduce elements of uncertainty estimation on the explanation level as well.

Traditionally used methods for uncertainty quantification include the joint prediction of a distribution’s mean and variance (MVE) [28], assessing the variance between the predictions of a Deep Ensemble [16] and using bayesian neural networks (BNNs) [42, 4, 12] which aim to directly predict an output distribution rather than individual values. More recent alternatives include stochastic weight averaging gaussians (SWAG) [25] and the idea of Repulsive Ensembles [7, 43] as an extension to Deep Ensembles built on the general framework of particle based variational inference (ParVI) [21, 20] introducing explicit member diversification.

In the domain of molecular property prediction, Hirschfeld et al. [13] and Scalia et al. [32] independently investigate the performance of various traditional uncertainty quantification methods across many standard property prediction datasets. Busk et al. specifically investigate uncertainty quantification using an ensemble of graph neural networks [6].

2.0.3 Uncertainty Quantification and Counterfactuals.

Using xAI to gain new insights into the underlying properties of the data distribution requires the given explanations to be truthful regarding the true property values. In the same context, Freiesleben [10] addresses the conceptual distinction between counterfactual explanations and adversarial examples. Although essentially based on the same optimization objective, the author argues that adversarial examples necessitate a misprediction while counterfactual explanations should be different—yet still correct.

While uncertainty quantification in the context of counterfactual explanations remains largely unexplored, we find Delaney et al. [8] to use uncertainty quantification methods as a possible measure to increase counterfactual reliability for image classification tasks. In terms of UQ interventions, the authors explore Trust Scores and Monte Carlo dropout, finding Trust Scores to be an effective measure. Schut et al. [33] propose the direct optimization of an ensemble-based uncertainty measure as a secondary objective for the generation of realistic counterfactuals for image classification. In another work, Antorán et al. [2] introduce Counterfactual Latent Uncertainty Explanations (CLUE), which is subsequently extended to δ𝛿\deltaitalic_δ-CLUE [18] and GLAM-CLUE [19]. Instead of employing uncertainty quantification to improve counterfactual explainability, CLUE aims to use counterfactual explanations to explain uncertainty estimates in probabilistic models—effectively explaining why certain inputs are more uncertain than others.

3 Method

In this work, we explore the generation of counterfactual samples xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for molecular property prediction tasks, whereby a graph neural network model is trained to regress a continuous property y𝑦yitalic_y of a given molecular graph x𝑥xitalic_x. To gain meaningful insights on the underlying property, we specifically focus on the generation of truthful counterfactuals which maximize the prediction difference |y^y^|^𝑦superscript^𝑦|\hat{y}-\hat{y}^{\prime}|| over^ start_ARG italic_y end_ARG - over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | between original prediction y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG and counterfactual prediction y^superscript^𝑦\hat{y}^{\prime}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT while maintaining a minimal ground truth error |yy^|superscript𝑦superscript^𝑦|y^{\prime}-\hat{y}^{\prime}|| italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT |.

3.1 Graph Neural Network Regressors

We represent each molecule as a generic graph structure x=(𝒩,,𝐕(0),𝐔(0))𝒳𝑥𝒩superscript𝐕0superscript𝐔0𝒳x=(\mathcal{N},\mathcal{E},\mathbf{V}^{(0)},\mathbf{U}^{(0)})\in\mathcal{X}italic_x = ( caligraphic_N , caligraphic_E , bold_V start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_U start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) ∈ caligraphic_X defined by a set of N𝑁Nitalic_N node indices 𝒩={1,,N}𝒩1𝑁\mathcal{N}=\left\{1,\dots,N\right\}caligraphic_N = { 1 , … , italic_N } and a list of E𝐸Eitalic_E edge tuples N×𝒩𝑁𝒩\mathcal{E}\subseteq{N}\times\mathcal{N}caligraphic_E ⊆ italic_N × caligraphic_N where a tuple (i,j)𝑖𝑗(i,j)\in\mathcal{E}( italic_i , italic_j ) ∈ caligraphic_E indicates an edge between nodes i𝑖iitalic_i and j𝑗jitalic_j. The nodes of this graph structure represent the atoms of the molecule and the edges represent the chemical bonds between the atoms. Furthermore, each graph structure consists of an initial node feature tensor 𝐕(0)N×Vsuperscript𝐕0superscript𝑁𝑉\mathbf{V}^{(0)}\in\mathbb{R}^{N\times V}bold_V start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_V end_POSTSUPERSCRIPT and an initial edge feature tensor 𝐔(0)E×Usuperscript𝐔0superscript𝐸𝑈\mathbf{U}^{(0)}\in\mathbb{R}^{E\times U}bold_U start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_E × italic_U end_POSTSUPERSCRIPT.

In the case of molecular graphs, the node features contain a one-hot encoding of the atom type, the atomic weight, and the charge, whereas the edge features contain a one-hot encoding of the bond type.          For a given dataset of molecules annotated with continuous target values y𝑦y\in\mathbb{R}italic_y ∈ blackboard_R, the aim is to train a graph neural network regressor

fθ:𝒳;(𝒩,,𝐕(0),𝐔(0))y^f_{\theta}:\quad\mathcal{X}\rightarrow\mathbb{R};\quad(\mathcal{N},\mathcal{E}% ,\mathbf{V}^{(0)},\mathbf{U}^{(0)})\mapsto\hat{y}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_X → blackboard_R ; ( caligraphic_N , caligraphic_E , bold_V start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_U start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) ↦ over^ start_ARG italic_y end_ARG (1)

with learnable parameters θ𝜃\thetaitalic_θ to find an optimal set of parameters

θ=argminθx𝒳(yfθ(x))2superscript𝜃subscript𝜃subscript𝑥subscript𝒳superscript𝑦subscript𝑓𝜃𝑥2\theta^{\ast}=\arg\min_{\theta}\sum_{x\in\mathcal{X}_{\text{}}}(y-f_{\theta}(x% ))^{2}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X start_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_y - italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (2)

that minimizes the mean-squared error between the predicted value y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG and target y𝑦yitalic_y value.

Refer to caption
Figure 2: a Evaluation of the uncertainty-based error reduction over many different thresholds yields characteristic error reduction curves. The Area under the uncertainty error reduction curve (UER-AUC) provides a generic metric for the error reduction potential independent of a specific threshold choice. b Truthful counterfactuals are defined as those whose prediction error interval does not overlap with that of its corresponding original element. Besides a reduction of the cumulative error, filtering by uncertainty thresholds may also increase the relative fraction of truthful counterfactuals.

3.2 Molecular Counterfactual Generation

Counterfactual explanations map a model’s local decision boundary by producing a set of minimally perturbed input instances that induce maximal predictive divergence, thereby revealing which kinds of modifications the model is especially sensitive toward.

Given the combination (x,y^)𝑥^𝑦(x,\hat{y})( italic_x , over^ start_ARG italic_y end_ARG ) of an original input element x𝑥xitalic_x and its corresponding model prediction y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, a counterfactual sample (x,y^)superscript𝑥superscript^𝑦(x^{\prime},\hat{y}^{\prime})( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) consists of input samples xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT which are minimally

minxdist(x,x)superscript𝑥distsuperscript𝑥𝑥\underset{x^{\prime}}{\min}\;\text{dist}(x^{\prime},x)start_UNDERACCENT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_min end_ARG dist ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x ) (3)

different from the original input. At the same time, these minimally perturbed input samples should cause a large deviation

maxxdist(y^,y^)superscript𝑥distsuperscript^𝑦^𝑦\underset{x^{\prime}}{\max}\;\text{dist}(\hat{y}^{\prime},\hat{y})start_UNDERACCENT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_max end_ARG dist ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG ) (4)

in the model’s prediction.

We generate counterfactual samples according to the given constraints by adopting a procedure similar to that presented by Numeroso and Bacciu [29]. However, we omit the training of a reinforcement learning agent to induce the local changes of the molecular structure and opt for a complete enumeration of the entire k𝑘kitalic_k-edit neighborhood instead. Due to the limited number of chemically valid modifications and the relatively small size of molecular graphs, we find it computationally feasible to generate all possible modifications to a given input molecule x𝑥xitalic_x. As possible modifications, we consider the addition, deletion, and substitution of individual atoms and bonds that satisfy the constraints of atomic valence. Subsequently, the predictive model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is used to obtain the predicted values for all the perturbed graph structures. The structures are then ranked according to the mean absolute prediction difference

dist(y^,y^)=|y^y^|distsuperscript^𝑦^𝑦superscript^𝑦^𝑦\text{dist}(\hat{y}^{\prime},\hat{y})=|\hat{y}^{\prime}-\hat{y}|dist ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG ) = | over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over^ start_ARG italic_y end_ARG | (5)

regarding the original prediction y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG. We finally choose the 10 elements with the highest prediction difference to be presented as counterfactual explanations.

At this point, it is worth noting that other possible variations of choosing counterfactual explanations exist. Instead of using the criteria of absolute distance, depending on the use case, it might make sense to select counterfactuals only among those samples with monotonically higher or lower predicted values.

3.3 Counterfactual Truthfulness

A counterfactual explanation xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has to be a minimal perturbation of the original sample x𝑥xitalic_x while causing a large deviation |y^y^|^𝑦superscript^𝑦|\hat{y}-\hat{y}^{\prime}|| over^ start_ARG italic_y end_ARG - over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | in the model’s prediction. To gain meaningful insight from such counterfactual explanations and to distinguish them from mere adversarial examples [10], we impose the additional restriction of truthfulness. We define a truthful counterfactual to additionally maintain a low error |yy^|superscript𝑦superscript^𝑦|y^{\prime}-\hat{y}^{\prime}|| italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | regarding its ground truth label ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

For classification problems, we would understand a truthful counterfactual not only to flip the predicted label but to also correctly be associated with that label. For the regression case, there may exist various equally useful definitions of counterfactual truthfulness. In this context, we define a regression counterfactual as truthful if its ground truth error interval does not overlap with the error interval of the original prediction (see Figure 2). This definition ensures that there is at least some predictive divergence with the predicted directionality.

For a given original sample, its absolute ground truth error

ϵ=|yy^|italic-ϵ𝑦^𝑦\epsilon=|y-\hat{y}|italic_ϵ = | italic_y - over^ start_ARG italic_y end_ARG | (6)

is calculated as the absolute difference of the true value y𝑦yitalic_y and the predicted label y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG. The ground truth error

ϵ=|yy^|superscriptitalic-ϵsuperscript𝑦superscript^𝑦\epsilon^{\prime}=|y^{\prime}-\hat{y}^{\prime}|italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = | italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | (7)

of a counterfactual sample xsuperscript𝑥x^{\prime}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be calculated accordingly. We subsequently define the truthfulness

tr(x)={1[yϵ,y+ϵ][yϵ,y+ϵ]=0otherwisetrsuperscript𝑥cases1superscript𝑦superscriptitalic-ϵsuperscript𝑦superscriptitalic-ϵ𝑦italic-ϵ𝑦italic-ϵ0otherwise\mathrm{tr}(x^{\prime})=\begin{cases}1&[y^{\prime}-\epsilon^{\prime},y^{\prime% }+\epsilon^{\prime}]\cap[y-\epsilon,y+\epsilon]=\emptyset\\ 0&\text{otherwise}\end{cases}roman_tr ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = { start_ROW start_CELL 1 end_CELL start_CELL [ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ∩ [ italic_y - italic_ϵ , italic_y + italic_ϵ ] = ∅ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW (8)

of an individual counterfactual as a binary property which is fulfilled if its ground truth error interval does not overlap with the error interval of the original sample.

Beyond the truthfulness of individual counterfactual samples, we are primarily interested in the average truthfulness across a whole set 𝒳𝒳superscript𝒳𝒳\mathcal{X}^{\prime}\subset\mathcal{X}caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊂ caligraphic_X of counterfactuals. We, therefore, define the relative truthfulness

Tr(𝒳)=1|𝒳|x𝒳tr(x)[0,1]\mathrm{Tr}(\mathcal{X}^{\prime})=\frac{1}{|\mathcal{X}^{\prime}|}\sum_{x^{% \prime}\in\mathcal{X}^{\prime}}\mathrm{tr}(x^{\prime})\quad\in[0,1]roman_Tr ( caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_tr ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ [ 0 , 1 ] (9)

for a set 𝒳superscript𝒳\mathcal{X}^{\prime}caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of counterfactuals as the ratio of individual truthful counterfactuals it contains.

At this point, it should be noted that evaluating counterfactual truthfulness proves difficult. Since the generated counterfactual samples generally aren’t contained in existing datasets, evaluating the truthfulness would not only require ground truth labels but rather a ground truth oracle. Consequently, truthfulness can only be evaluated for a small selection of tasks for which such an oracle exists.

3.4 Error Reduction through Uncertainty Thresholding

For the given definition of truthfulness, one viable method of improving the relative truthfulness Tr(𝒳)Trsuperscript𝒳\mathrm{Tr}(\mathcal{X}^{\prime})roman_Tr ( caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is to filter counterfactuals with especially large error intervals. Since it is generally impossible to infer the true label, and by extension the truthfulness, of a given input x𝑥xitalic_x in practice, an alternative is to approximate the ground truth error by means of uncertainty quantification (UQ). If the predicted uncertainty proves to be a suitable approximation of the true error, filtering high-uncertainty counterfactuals should have the same effect of improving the relative truthfulness.

This objective can be framed as an overall reduction of the cumulative error

Γg=g({|y^iyi|:xi𝒳})\Gamma^{g}=g\left(\left\{|\hat{y}_{i}-y_{i}|\,:\,x_{i}\in\mathcal{X}^{\circ}% \right\}\right)roman_Γ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = italic_g ( { | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | : italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT } ) (10)

for a given set 𝒳𝒳superscript𝒳𝒳\mathcal{X}^{\circ}\subset\mathcal{X}caligraphic_X start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT ⊂ caligraphic_X of input elements, where g()𝑔g(\cdot)italic_g ( ⋅ ) is some function that accumulates individual error values (e.g. mean, median, max).

In the context of uncertainty quantification, each sample x𝑥xitalic_x is additionally assigned a predicted uncertainty value σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Ideally, a high uncertainty value indicates a potential error in the model prediction, while a low value indicates the prediction to be likely correct. By filtering individual samples with high predicted uncertainties, it should, therefore, be possible to reduce the cumulative error ΓgsuperscriptΓ𝑔\Gamma^{g}roman_Γ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT among the remaining elements. For this purpose, we can define the absolute cumulative error

Γg(ξ)=g({|y^iyi|:xi𝒳|σi2σmax2<ξ})\Gamma^{g}(\xi)=g\left(\left\{|\hat{y}_{i}-y_{i}|\,:\,x_{i}\in\mathcal{X}^{% \circ}\;|\;\frac{\sigma^{2}_{i}}{\sigma^{2}_{max}}<\xi\right\}\right)roman_Γ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ξ ) = italic_g ( { | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | : italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT | divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG < italic_ξ } ) (11)

as a function of the relative uncertainty threshold ξ[0,1]𝜉01\xi\in[0,1]italic_ξ ∈ [ 0 , 1 ] used for the filtering.

This definition of cumulative error faces two issues. Firstly, values of the cumulative error will strongly depend on the specific uncertainty threshold ξ𝜉\xiitalic_ξ that was chosen. Secondly, the absolute error scales will be vastly different between different tasks and model performances, therefore not being comparable. Consequently, we propose the area under the uncertainty error reduction curve (UER-AUC) as a metric to assess the potential for uncertainty filtering-based error reduction that is comparable across different error scales. To compute the metric, we define the relative cumulative error reduction

ΔΓrelg(ξ)=Γg(1)Γg(ξ)maxξΓg(ξ)[0,1]ΔsubscriptsuperscriptΓ𝑔𝑟𝑒𝑙𝜉superscriptΓ𝑔1superscriptΓ𝑔𝜉subscript𝜉superscriptΓ𝑔𝜉01\Delta\Gamma^{g}_{rel}(\xi)=\frac{\Gamma^{g}(1)-\Gamma^{g}(\xi)}{\max_{\xi}% \Gamma^{g}(\xi)}\;\in[0,1]roman_Δ roman_Γ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT ( italic_ξ ) = divide start_ARG roman_Γ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( 1 ) - roman_Γ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ξ ) end_ARG start_ARG roman_max start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( italic_ξ ) end_ARG ∈ [ 0 , 1 ] (12)

which is a value in the range [0,1]01[0,1][ 0 , 1 ], where 0 indicates no error reduction while 1 indicates a 100% error reduction. We finally define the UER-AUCgsubscriptUER-AUC𝑔\text{UER-AUC}_{g}UER-AUC start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT as the area under the curve of the relative error reduction ΔΓrelg(ξ)ΔsubscriptsuperscriptΓ𝑔𝑟𝑒𝑙𝜉\Delta\Gamma^{g}_{rel}(\xi)roman_Δ roman_Γ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT ( italic_ξ ) as a function of the relative uncertainty threshold ξ𝜉\xiitalic_ξ. Consequently, the proposed metric is independent of any specific threshold and comparable across different error ranges as both the uncertainty threshold ξ𝜉\xiitalic_ξ and the relative error reduction ΔΓrelgΔsubscriptsuperscriptΓ𝑔𝑟𝑒𝑙\Delta\Gamma^{g}_{rel}roman_Δ roman_Γ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT are normalized to the range [0,1]01[0,1][ 0 , 1 ].

In terms of accumulation functions g𝑔gitalic_g, we primarily investigate the mean and the maximum, resulting in the two metrics UER-AUCmeansubscriptUER-AUCmean\text{UER-AUC}_{\text{mean}}UER-AUC start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT and UER-AUCmaxsubscriptUER-AUCmax\text{UER-AUC}_{\text{max}}UER-AUC start_POSTSUBSCRIPT max end_POSTSUBSCRIPT. Figure 2a illustrates a simple intuition about these metrics: A perfect correlation between uncertainty and model error will result in a UER-AUC of 0.5. Likewise, a UER-AUC of 0 would be the result of a non-existent correlation between uncertainty and error.

4 Computational Experiments

Computational experiments are structured in two major parts: In the first part, we systematically investigate the general error reduction potential of uncertainty estimation methods for different graph neural network architectures, different uncertainty estimation methods, various out-of-distribution settings, and a range of different datasets. In the second part, we consider the use of uncertainty quantification methods in the context of counterfactual explanations and their effect on overall counterfactual truthfulness as previously defined in Section 3.

4.1 Uncertainty Quantification Methods and Metrics

4.1.1 Uncertainty Quantification Methods.

As part of the computational experiments, we compare the following uncertainty quantification methods.

Deep Ensembles (DE).

We train 3 separate models with bootstrap aggregation, whereby the training data is sampled with replacement. The overall prediction is subsequently obtained as the mean of the individual model outputs, while the standard deviation of the individual predictions is used as an estimate of the uncertainty.

Mean Variance Estimation (MVE).

The base model architecture is augmented to predict not only the target value y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG but also an uncertainty term σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT by adding additional fully connected layers to the final prediction network [28]. The training loss

MVE=1NiNsg(σi2β)2((yiy^i)2σi2+log(σi2))subscriptMVE1𝑁superscriptsubscript𝑖𝑁sgsuperscriptsubscript𝜎𝑖2𝛽2superscriptsubscript𝑦𝑖subscript^𝑦𝑖2superscriptsubscript𝜎𝑖2superscriptsubscript𝜎𝑖2\mathcal{L}_{\mathrm{MVE}}=\frac{1}{N}\sum_{i}^{N}\frac{\texttt{sg}(\sigma_{i}% ^{2\beta})}{2}\cdot\left(\frac{(y_{i}-\hat{y}_{i})^{2}}{\sigma_{i}^{2}}+\log(% \sigma_{i}^{2})\right)caligraphic_L start_POSTSUBSCRIPT roman_MVE end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG sg ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_β end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 end_ARG ⋅ ( divide start_ARG ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + roman_log ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) (13)

is augmented to optimize both terms at the same time. We specifically integrate the modification proposed by Seitzer et al. [34], which scales the loss by an additional factor of σ2βsuperscript𝜎2𝛽\sigma^{2\beta}italic_σ start_POSTSUPERSCRIPT 2 italic_β end_POSTSUPERSCRIPT but without contributing to the gradient. Furthermore, during training, we follow best practices described by Sluijterman et al. [36] by using gradient clipping and including an MSE warm-up period before switching to the MVE loss. By combining these measures, we substantially improve the performance degradation otherwise reported in the literature.

Ensemble of mean-variance estimators (DE+MVE).

We combine deep ensembles and mean-variance estimation by constructing an ensemble of 3 independent MVE models, each of which predicts an individual mean and standard deviation, as proposed by Busk et al. [6]. The total uncertainty

σ2=12(σDE2+σ¯MVE2)superscript𝜎212subscriptsuperscript𝜎2DEsubscriptsuperscript¯𝜎2MVE\sigma^{2}=\frac{1}{2}\left(\sigma^{2}_{\mathrm{DE}}+\bar{\sigma}^{2}_{\mathrm% {MVE}}\right)italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_DE end_POSTSUBSCRIPT + over¯ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_MVE end_POSTSUBSCRIPT ) (14)

is calculated as the average of the ensemble uncertainty and the mean MVE uncertainty.

Table 1: Test set results of 5 independent repetitions of computational experiments on the ClogP dataset regarding different model architectures and uncertainty quantification methods. Normal case numbers are the average result, and lower case gray numbers are the standard deviation. For each combination of model and UQ method, the best results are highlighted in bold, and the second-best results are underlined.
Model UQ Method R2superscript𝑅2absentR^{2}\uparrowitalic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ↑ ρ𝜌absent\rho\uparrowitalic_ρ ↑ UER-AUCmean𝑚𝑒𝑎𝑛UER-AUCabsent\underset{mean}{\text{UER-AUC}}\uparrowstart_UNDERACCENT italic_m italic_e italic_a italic_n end_UNDERACCENT start_ARG UER-AUC end_ARG ↑ UER-AUCmax𝑚𝑎𝑥UER-AUCabsent\underset{max}{\text{UER-AUC}}\uparrowstart_UNDERACCENT italic_m italic_a italic_x end_UNDERACCENT start_ARG UER-AUC end_ARG ↑ RLLRLLabsent\text{RLL}\uparrowRLL ↑
Random 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.01±0.03plus-or-minus0.010.030.01{\color[rgb]{.5,.5,.5}\pm{0.03}}0.01 ± 0.03 0.01±0.04plus-or-minus0.010.040.01{\color[rgb]{.5,.5,.5}\pm{0.04}}0.01 ± 0.04 0.10±0.10plus-or-minus0.100.100.10{\color[rgb]{.5,.5,.5}\pm{0.10}}0.10 ± 0.10
GCN DE 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.41±0.17plus-or-minus0.410.170.41{\color[rgb]{.5,.5,.5}\pm{0.17}}0.41 ± 0.17 0.21±0.06plus-or-minus0.210.060.21{\color[rgb]{.5,.5,.5}\pm{0.06}}0.21 ± 0.06 0.36±0.18plus-or-minus0.360.180.36{\color[rgb]{.5,.5,.5}\pm{0.18}}0.36 ± 0.18 0.75±0.02plus-or-minus0.750.020.75{\color[rgb]{.5,.5,.5}\pm{0.02}}0.75 ± 0.02
MVE 0.99±0.01plus-or-minus0.990.010.99{\color[rgb]{.5,.5,.5}\pm{0.01}}0.99 ± 0.01 0.45±0.08plus-or-minus0.450.080.45{\color[rgb]{.5,.5,.5}\pm{0.08}}0.45 ± 0.08 0.20±0.04plus-or-minus0.200.040.20{\color[rgb]{.5,.5,.5}\pm{0.04}}0.20 ± 0.04 0.63±0.18plus-or-minus0.630.18\mathbf{0.63}{\color[rgb]{.5,.5,.5}\pm{0.18}}bold_0.63 ± 0.18 0.77¯±0.03plus-or-minus¯0.770.03\underline{0.77}{\color[rgb]{.5,.5,.5}\pm{0.03}}under¯ start_ARG 0.77 end_ARG ± 0.03
DE+MVE 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.55±0.10plus-or-minus0.550.10\mathbf{0.55}{\color[rgb]{.5,.5,.5}\pm{0.10}}bold_0.55 ± 0.10 0.25±0.02plus-or-minus0.250.02\mathbf{0.25}{\color[rgb]{.5,.5,.5}\pm{0.02}}bold_0.25 ± 0.02 0.58¯±0.20plus-or-minus¯0.580.20\underline{0.58}{\color[rgb]{.5,.5,.5}\pm{0.20}}under¯ start_ARG 0.58 end_ARG ± 0.20 0.78±0.00plus-or-minus0.780.00\mathbf{0.78}{\color[rgb]{.5,.5,.5}\pm{0.00}}bold_0.78 ± 0.00
SWAG 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.50¯±0.08plus-or-minus¯0.500.08\underline{0.50}{\color[rgb]{.5,.5,.5}\pm{0.08}}under¯ start_ARG 0.50 end_ARG ± 0.08 0.21¯±0.03plus-or-minus¯0.210.03\underline{0.21}{\color[rgb]{.5,.5,.5}\pm{0.03}}under¯ start_ARG 0.21 end_ARG ± 0.03 0.50±0.23plus-or-minus0.500.230.50{\color[rgb]{.5,.5,.5}\pm{0.23}}0.50 ± 0.23 0.56±0.11plus-or-minus0.560.110.56{\color[rgb]{.5,.5,.5}\pm{0.11}}0.56 ± 0.11
TSeucl. 0.99±0.01plus-or-minus0.990.010.99{\color[rgb]{.5,.5,.5}\pm{0.01}}0.99 ± 0.01 0.15±0.17plus-or-minus0.150.170.15{\color[rgb]{.5,.5,.5}\pm{0.17}}0.15 ± 0.17 0.15±0.09plus-or-minus0.150.090.15{\color[rgb]{.5,.5,.5}\pm{0.09}}0.15 ± 0.09 0.43±0.26plus-or-minus0.430.260.43{\color[rgb]{.5,.5,.5}\pm{0.26}}0.43 ± 0.26 0.69±0.10plus-or-minus0.690.100.69{\color[rgb]{.5,.5,.5}\pm{0.10}}0.69 ± 0.10
TStanim. 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.15±0.05plus-or-minus0.150.050.15{\color[rgb]{.5,.5,.5}\pm{0.05}}0.15 ± 0.05 0.11±0.04plus-or-minus0.110.040.11{\color[rgb]{.5,.5,.5}\pm{0.04}}0.11 ± 0.04 0.12±0.10plus-or-minus0.120.100.12{\color[rgb]{.5,.5,.5}\pm{0.10}}0.12 ± 0.10 0.39±0.33plus-or-minus0.390.330.39{\color[rgb]{.5,.5,.5}\pm{0.33}}0.39 ± 0.33
GATv2 DE 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.51¯±0.11plus-or-minus¯0.510.11\underline{0.51}{\color[rgb]{.5,.5,.5}\pm{0.11}}under¯ start_ARG 0.51 end_ARG ± 0.11 0.22±0.04plus-or-minus0.220.04{0.22}{\color[rgb]{.5,.5,.5}\pm{0.04}}0.22 ± 0.04 0.63±0.28plus-or-minus0.630.28{0.63}{\color[rgb]{.5,.5,.5}\pm{0.28}}0.63 ± 0.28 0.73¯±0.05plus-or-minus¯0.730.05\underline{0.73}{\color[rgb]{.5,.5,.5}\pm{0.05}}under¯ start_ARG 0.73 end_ARG ± 0.05
MVE 0.98±0.03plus-or-minus0.980.030.98{\color[rgb]{.5,.5,.5}\pm{0.03}}0.98 ± 0.03 0.48±0.08plus-or-minus0.480.080.48{\color[rgb]{.5,.5,.5}\pm{0.08}}0.48 ± 0.08 0.28¯±0.06plus-or-minus¯0.280.06\underline{0.28}{\color[rgb]{.5,.5,.5}\pm{0.06}}under¯ start_ARG 0.28 end_ARG ± 0.06 0.72¯±0.15plus-or-minus¯0.720.15\underline{0.72}{\color[rgb]{.5,.5,.5}\pm{0.15}}under¯ start_ARG 0.72 end_ARG ± 0.15 0.72±0.08plus-or-minus0.720.080.72{\color[rgb]{.5,.5,.5}\pm{0.08}}0.72 ± 0.08
DE+MVE 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.64±0.15plus-or-minus0.640.15\mathbf{0.64}{\color[rgb]{.5,.5,.5}\pm{0.15}}bold_0.64 ± 0.15 0.34±0.02plus-or-minus0.340.02\mathbf{0.34}{\color[rgb]{.5,.5,.5}\pm{0.02}}bold_0.34 ± 0.02 0.75±0.09plus-or-minus0.750.09\mathbf{0.75}{\color[rgb]{.5,.5,.5}\pm{0.09}}bold_0.75 ± 0.09 0.82±0.03plus-or-minus0.820.03\mathbf{0.82}{\color[rgb]{.5,.5,.5}\pm{0.03}}bold_0.82 ± 0.03
SWAG 0.99±0.00plus-or-minus0.990.000.99{\color[rgb]{.5,.5,.5}\pm{0.00}}0.99 ± 0.00 0.49±0.16plus-or-minus0.490.160.49{\color[rgb]{.5,.5,.5}\pm{0.16}}0.49 ± 0.16 0.21±0.02plus-or-minus0.210.020.21{\color[rgb]{.5,.5,.5}\pm{0.02}}0.21 ± 0.02 0.61±0.21plus-or-minus0.610.210.61{\color[rgb]{.5,.5,.5}\pm{0.21}}0.61 ± 0.21 0.06±0.47plus-or-minus0.060.47-0.06{\color[rgb]{.5,.5,.5}\pm{0.47}}- 0.06 ± 0.47
TSeucl. 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.07±0.04plus-or-minus0.070.040.07{\color[rgb]{.5,.5,.5}\pm{0.04}}0.07 ± 0.04 0.17±0.07plus-or-minus0.170.070.17{\color[rgb]{.5,.5,.5}\pm{0.07}}0.17 ± 0.07 0.59±0.00plus-or-minus0.590.000.59{\color[rgb]{.5,.5,.5}\pm{0.00}}0.59 ± 0.00 0.64±0.01plus-or-minus0.640.010.64{\color[rgb]{.5,.5,.5}\pm{0.01}}0.64 ± 0.01
TStanim. 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.20±0.04plus-or-minus0.200.040.20{\color[rgb]{.5,.5,.5}\pm{0.04}}0.20 ± 0.04 0.13±0.03plus-or-minus0.130.030.13{\color[rgb]{.5,.5,.5}\pm{0.03}}0.13 ± 0.03 0.10±0.07plus-or-minus0.100.070.10{\color[rgb]{.5,.5,.5}\pm{0.07}}0.10 ± 0.07 0.59±0.06plus-or-minus0.590.060.59{\color[rgb]{.5,.5,.5}\pm{0.06}}0.59 ± 0.06
GIN DE 0.99±0.01plus-or-minus0.990.010.99{\color[rgb]{.5,.5,.5}\pm{0.01}}0.99 ± 0.01 0.62¯±0.17plus-or-minus¯0.620.17\underline{0.62}{\color[rgb]{.5,.5,.5}\pm{0.17}}under¯ start_ARG 0.62 end_ARG ± 0.17 0.27¯±0.06plus-or-minus¯0.270.06\underline{0.27}{\color[rgb]{.5,.5,.5}\pm{0.06}}under¯ start_ARG 0.27 end_ARG ± 0.06 0.70¯±0.11plus-or-minus¯0.700.11\underline{0.70}{\color[rgb]{.5,.5,.5}\pm{0.11}}under¯ start_ARG 0.70 end_ARG ± 0.11 0.80±0.04plus-or-minus0.800.04\mathbf{0.80}{\color[rgb]{.5,.5,.5}\pm{0.04}}bold_0.80 ± 0.04
MVE 0.99±0.01plus-or-minus0.990.010.99{\color[rgb]{.5,.5,.5}\pm{0.01}}0.99 ± 0.01 0.48±0.11plus-or-minus0.480.110.48{\color[rgb]{.5,.5,.5}\pm{0.11}}0.48 ± 0.11 0.22±0.05plus-or-minus0.220.050.22{\color[rgb]{.5,.5,.5}\pm{0.05}}0.22 ± 0.05 0.56±0.22plus-or-minus0.560.220.56{\color[rgb]{.5,.5,.5}\pm{0.22}}0.56 ± 0.22 0.75±0.05plus-or-minus0.750.050.75{\color[rgb]{.5,.5,.5}\pm{0.05}}0.75 ± 0.05
DE+MVE 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.63±0.05plus-or-minus0.630.05\mathbf{0.63}{\color[rgb]{.5,.5,.5}\pm{0.05}}bold_0.63 ± 0.05 0.29±0.03plus-or-minus0.290.03\mathbf{0.29}{\color[rgb]{.5,.5,.5}\pm{0.03}}bold_0.29 ± 0.03 0.70±0.15plus-or-minus0.700.15\mathbf{0.70}{\color[rgb]{.5,.5,.5}\pm{0.15}}bold_0.70 ± 0.15 0.78¯±0.01plus-or-minus¯0.780.01\underline{0.78}{\color[rgb]{.5,.5,.5}\pm{0.01}}under¯ start_ARG 0.78 end_ARG ± 0.01
SWAG 0.98±0.02plus-or-minus0.980.020.98{\color[rgb]{.5,.5,.5}\pm{0.02}}0.98 ± 0.02 0.58±0.20plus-or-minus0.580.200.58{\color[rgb]{.5,.5,.5}\pm{0.20}}0.58 ± 0.20 0.23±0.07plus-or-minus0.230.070.23{\color[rgb]{.5,.5,.5}\pm{0.07}}0.23 ± 0.07 0.58±0.08plus-or-minus0.580.080.58{\color[rgb]{.5,.5,.5}\pm{0.08}}0.58 ± 0.08 0.02±0.43plus-or-minus0.020.430.02{\color[rgb]{.5,.5,.5}\pm{0.43}}0.02 ± 0.43
TSeucl. 0.99±0.00plus-or-minus0.990.000.99{\color[rgb]{.5,.5,.5}\pm{0.00}}0.99 ± 0.00 0.15±0.12plus-or-minus0.150.120.15{\color[rgb]{.5,.5,.5}\pm{0.12}}0.15 ± 0.12 0.17±0.05plus-or-minus0.170.050.17{\color[rgb]{.5,.5,.5}\pm{0.05}}0.17 ± 0.05 0.45±0.22plus-or-minus0.450.220.45{\color[rgb]{.5,.5,.5}\pm{0.22}}0.45 ± 0.22 0.64±0.03plus-or-minus0.640.030.64{\color[rgb]{.5,.5,.5}\pm{0.03}}0.64 ± 0.03
TStanim. 0.99±0.00plus-or-minus0.990.000.99{\color[rgb]{.5,.5,.5}\pm{0.00}}0.99 ± 0.00 0.17±0.08plus-or-minus0.170.080.17{\color[rgb]{.5,.5,.5}\pm{0.08}}0.17 ± 0.08 0.13±0.05plus-or-minus0.130.050.13{\color[rgb]{.5,.5,.5}\pm{0.05}}0.13 ± 0.05 0.11±0.11plus-or-minus0.110.110.11{\color[rgb]{.5,.5,.5}\pm{0.11}}0.11 ± 0.11 0.52±0.12plus-or-minus0.520.120.52{\color[rgb]{.5,.5,.5}\pm{0.12}}0.52 ± 0.12
Stochastic Weight Averaging Gaussian (SWAG).

The training process is augmented to store snapshots of the model weights during the last 25 epochs. This history of model weights is then used to calculate a mean weight vector μθsubscript𝜇𝜃\mathbf{\mu}_{\theta}italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and a covariance matrix 𝚺θsubscript𝚺𝜃\mathbf{\Sigma}_{\theta}bold_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT such that a new set of model weights can approximately be obtained by drawing from a gaussian distribution θ𝒩(μθ,Σθ)similar-to𝜃𝒩subscript𝜇𝜃subscriptΣ𝜃\theta\sim\mathcal{N}(\mu_{\theta},\Sigma_{\theta})italic_θ ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ). During inference, we sample 50 distinct model weights from this distribution and obtain the target value prediction as the mean of the individual predictions and an uncertainty estimate as the standard deviation.

Trust Scores (TS).

Unlike the previously described UQ methods, trust scores are independent of the predictive model and provide an uncertainty estimate based directly on the training data [8]. Originally introduced for classification problems, the trust score for a given input element x𝑥xitalic_x is calculated as the fraction

T=dist(x,xs)dist(x,xo)𝑇dist𝑥subscript𝑥𝑠dist𝑥subscript𝑥𝑜T=\frac{\mathrm{dist}(x,x_{s})}{\mathrm{dist}(x,x_{o})}italic_T = divide start_ARG roman_dist ( italic_x , italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG roman_dist ( italic_x , italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) end_ARG (15)

between the distances of the closest training element xssubscript𝑥𝑠x_{s}italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT of the same class and the closest training element xosubscript𝑥𝑜x_{o}italic_x start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT of a different class. We adapt this approach for regression tasks by using the distance to the closest element. This definition relies on the existence of a suitable distance metric dist(xi,xj)distsubscript𝑥𝑖subscript𝑥𝑗\mathrm{dist}(x_{i},x_{j})roman_dist ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) between two input elements. In this study, we examine two distance metrics for comparing input elements. The first is the Tanimoto distance, which is calculated as the Jaccard distance between two Morgan fingerprint representations of two molecules. The second is the Euclidean distance, which is measured between the graph embeddings generated by an intermediate layer of the graph neural network models.

Uncertainty Calibration.

After training, we apply uncertainty calibration to each UQ method to align the predicted uncertainties with the scale of the actual prediction errors. For this purpose, we use a held-out validation set containing 10% of the data to subsequently fit an isotonic regression model.

4.1.2 Uncertainty Quantification Metrics.

We evaluate the aforementioned UQ methods with the following metrics.

Uncertainty-Error Correlation ρ𝜌\rhoitalic_ρ.

The Pearson correlation coefficient between the absolute prediction errors |y^y|^𝑦𝑦|\hat{y}-y|| over^ start_ARG italic_y end_ARG - italic_y | and the predicted uncertainties σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on the elements of the test set.

Error Reduction Potential UER-AUC.

As described in Section 3, the UER-AUC is the area under the curve that maps relative error reduction to relative uncertainty thresholds. For each uncertainty threshold, all elements with higher predicted uncertainty are omitted from the test set. The relative error reduction describes the reduction of the cumulative error of the remaining elements relative to the full set.

Relative Log Likelihood RLL.

Following the work of Kellner and Ceriotti [14] we use the Relative Log Likelihood

RLL=iNLL(y^iyi,σi2)NLL(y^iyi,RMSE)iNLL(y^iyi,|y^iyi|)NLL(y^iyi,RMSE)RLLsubscript𝑖NLLsubscript^𝑦𝑖subscript𝑦𝑖subscriptsuperscript𝜎2𝑖NLLsubscript^𝑦𝑖subscript𝑦𝑖RMSEsubscript𝑖NLLsubscript^𝑦𝑖subscript𝑦𝑖subscript^𝑦𝑖subscript𝑦𝑖𝑁𝐿𝐿subscript^𝑦𝑖subscript𝑦𝑖RMSE\mathrm{RLL}=\frac{\sum_{i}\mathrm{NLL}(\hat{y}_{i}-y_{i},\sigma^{2}_{i})-% \mathrm{NLL}(\hat{y}_{i}-y_{i},\mathrm{RMSE})}{\sum_{i}\mathrm{NLL}(\hat{y}_{i% }-y_{i},|\hat{y}_{i}-y_{i}|)-NLL(\hat{y}_{i}-y_{i},\mathrm{RMSE})}roman_RLL = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_NLL ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - roman_NLL ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_RMSE ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_NLL ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , | over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ) - italic_N italic_L italic_L ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_RMSE ) end_ARG (16)

which standardizes the arbitrary range of the Negative Log Likelihood

NLL(Δy,σ2)=12(Δy2σ2+log2πσ2)NLLΔ𝑦superscript𝜎212Δsuperscript𝑦2superscript𝜎22𝜋superscript𝜎2\mathrm{NLL}(\Delta y,\sigma^{2})=\frac{1}{2}\left(\frac{\Delta y^{2}}{\sigma^% {2}}+\log 2\pi\sigma^{2}\right)roman_NLL ( roman_Δ italic_y , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG roman_Δ italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + roman_log 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (17)

into a more interpretable range (,1]1(-\infty,1]( - ∞ , 1 ].

4.2 Experiments on Error Reduction Potential

4.2.1 Impact of GNN Model and UQ Method on Error Reduction.

In this first experiment, we evaluate the impact of the model architecture and uncertainty quantification method on uncertainty-based error reduction. The experiment is based on the ClogP dataset, which consists of roughly 11k small molecules annotated with values of Crippen’s logP [47] calculated by RDKit [17]. This logP value is an algorithmically calculated and deterministic property—making it possible to near-perfectly regress it with machine learning models.

In terms of model choice, we compare three standard GNN architectures based on the GCN [15], GATv2 [5], and GIN [50] layer types, respectively. For each repetition of the experiment, we randomly choose 10% of the dataset as the test set, 10% as the calibration set and train the model on the remaining. Therefore, the test set can be considered IID w.r.t. to the training distribution.

Table 1 shows the results of the first experiment. A "Random" baseline, generating random uncertainty values, was included as a control. As expected, this baseline demonstrates negligible error reduction, reflecting the absence of correlation between assigned uncertainty and prediction error. In contrast, the remaining uncertainty quantification methods exhibit varying degrees of error reduction.

Using trust scores with the input-based Tanimoto distance yields substantially worse results than the embedding-based Euclidean distance. Contrary to the encouraging results of Delaney et al. [8], we believe trust scores underperform in this particular application due to the challenge of defining suitable distance metrics on graph-structured data [48].

Overall, we find deep ensembles, mean-variance estimation, and a combination thereof to work the best in terms of error reduction potential, as well as relative log likelihood. Out of these methods, we observe a slight advantage in mean error reduction for the combined ensemble and mean-variance estimation approach.

Moreover, regarding the different model architectures (GCN, GATv2, and GIN), we observe comparable results, both in terms of predictive performance (R20.99superscript𝑅20.99R^{2}\geq 0.99italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ 0.99) and in terms of uncertainty quantification methods. Based on these initial observations, model architecture appears to have a limited effect on the relative performance of the uncertainty quantification methods. Consequently, subsequent experiments were conducted using the GATv2 architecture, which exhibited the highest mean error reduction potential in this experiment.

Refer to caption
Figure 3: Results for 5 independent repetitions of a GATv2 trained on the ClogP dataset and uncertainties estimated with a combination of ensembles and mean variance estimation in the OOD-Value scenario. Panels from left to right illustrate the correlation between the predicted uncertainty & model error, the mean error reduction potential, and the max error reduction potential through filtering by uncertainty thresholds. Faint lines represent the results of individual runs; bold lines represent the overall average.

4.2.2 Out-of-distribution Effect on Error Reduction.

The previous experiment examined the error reduction potential on a randomly sampled IID test set of the ClogP dataset. However, a critical aspect of counterfactual analysis involves identifying input perturbations that yield out-of-distribution (OOD) samples. To address this, we established two OOD scenarios for the ClogP dataset. The first, designated OOD-Struct, employs a scaffold split, where the test set comprises molecules with structural scaffolds absent from the training set. The second, OOD-Value, involves a split where the test set contains approximately the 10% most extreme target values, not represented in the training set. Due to the results of the previous experiments, for each scenario, we restrict experiments to use the GATv2 model architecture and compare uncertainty estimation based on ensembles, mean variance estimation, and the combination thereof.

Table 2: Test set results of 5 independent repetitions of computational experiments on the ClogP dataset regarding different out-of-distribution scenarios and uncertainty quantification methods. The best result for each scenario is highlighted in bold, and the second-best result is underlined. Results were obtained based on a GATv2 model architecture.
Scenario UQ Method R2superscript𝑅2absentR^{2}\uparrowitalic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ↑ ρ𝜌absent\rho\uparrowitalic_ρ ↑ UER-AUCmean𝑚𝑒𝑎𝑛UER-AUCabsent\underset{mean}{\text{UER-AUC}}\uparrowstart_UNDERACCENT italic_m italic_e italic_a italic_n end_UNDERACCENT start_ARG UER-AUC end_ARG ↑ UER-AUCmax𝑚𝑎𝑥UER-AUCabsent\underset{max}{\text{UER-AUC}}\uparrowstart_UNDERACCENT italic_m italic_a italic_x end_UNDERACCENT start_ARG UER-AUC end_ARG ↑ RLLRLLabsent\text{RLL}\uparrowRLL ↑
OOD DE 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.45¯±0.05plus-or-minus¯0.450.05\underline{0.45}{\color[rgb]{.5,.5,.5}\pm{0.05}}under¯ start_ARG 0.45 end_ARG ± 0.05 0.23±0.04plus-or-minus0.230.04\mathbf{0.23}{\color[rgb]{.5,.5,.5}\pm{0.04}}bold_0.23 ± 0.04 0.66±0.07plus-or-minus0.660.07\mathbf{0.66}{\color[rgb]{.5,.5,.5}\pm{0.07}}bold_0.66 ± 0.07 0.34¯±0.06plus-or-minus¯0.340.06\underline{0.34}{\color[rgb]{.5,.5,.5}\pm{0.06}}under¯ start_ARG 0.34 end_ARG ± 0.06
struct MVE 0.99±0.00plus-or-minus0.990.000.99{\color[rgb]{.5,.5,.5}\pm{0.00}}0.99 ± 0.00 0.32±0.15plus-or-minus0.320.150.32{\color[rgb]{.5,.5,.5}\pm{0.15}}0.32 ± 0.15 0.14±0.06plus-or-minus0.140.060.14{\color[rgb]{.5,.5,.5}\pm{0.06}}0.14 ± 0.06 0.20±0.11plus-or-minus0.200.110.20{\color[rgb]{.5,.5,.5}\pm{0.11}}0.20 ± 0.11 0.41±0.03plus-or-minus0.410.030.41{\color[rgb]{.5,.5,.5}\pm{0.03}}0.41 ± 0.03
DE+MVE 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.46±0.05plus-or-minus0.460.05\mathbf{0.46}{\color[rgb]{.5,.5,.5}\pm{0.05}}bold_0.46 ± 0.05 0.21¯±0.04plus-or-minus¯0.210.04\underline{0.21}{\color[rgb]{.5,.5,.5}\pm{0.04}}under¯ start_ARG 0.21 end_ARG ± 0.04 0.42¯±0.17plus-or-minus¯0.420.17\underline{0.42}{\color[rgb]{.5,.5,.5}\pm{0.17}}under¯ start_ARG 0.42 end_ARG ± 0.17 0.55±0.00plus-or-minus0.550.00\mathbf{0.55}{\color[rgb]{.5,.5,.5}\pm{0.00}}bold_0.55 ± 0.00
OOD DE 0.97±0.01plus-or-minus0.970.010.97{\color[rgb]{.5,.5,.5}\pm{0.01}}0.97 ± 0.01 0.62¯±0.07plus-or-minus¯0.620.07\underline{0.62}{\color[rgb]{.5,.5,.5}\pm{0.07}}under¯ start_ARG 0.62 end_ARG ± 0.07 0.71±0.10plus-or-minus0.710.10\mathbf{0.71}{\color[rgb]{.5,.5,.5}\pm{0.10}}bold_0.71 ± 0.10 0.82±0.07plus-or-minus0.820.07\mathbf{0.82}{\color[rgb]{.5,.5,.5}\pm{0.07}}bold_0.82 ± 0.07 3.79¯±2.92plus-or-minus¯3.792.92\underline{-3.79}{\color[rgb]{.5,.5,.5}\pm{2.92}}under¯ start_ARG - 3.79 end_ARG ± 2.92
value MVE 0.99±0.00plus-or-minus0.990.000.99{\color[rgb]{.5,.5,.5}\pm{0.00}}0.99 ± 0.00 0.50±0.08plus-or-minus0.500.080.50{\color[rgb]{.5,.5,.5}\pm{0.08}}0.50 ± 0.08 0.51±0.11plus-or-minus0.510.110.51{\color[rgb]{.5,.5,.5}\pm{0.11}}0.51 ± 0.11 0.36±0.27plus-or-minus0.360.270.36{\color[rgb]{.5,.5,.5}\pm{0.27}}0.36 ± 0.27 8.04±9.72plus-or-minus8.049.72-8.04{\color[rgb]{.5,.5,.5}\pm{9.72}}- 8.04 ± 9.72
DE+MVE 0.98±0.00plus-or-minus0.980.000.98{\color[rgb]{.5,.5,.5}\pm{0.00}}0.98 ± 0.00 0.66±0.04plus-or-minus0.660.04\mathbf{0.66}{\color[rgb]{.5,.5,.5}\pm{0.04}}bold_0.66 ± 0.04 0.67¯±0.05plus-or-minus¯0.670.05\underline{0.67}{\color[rgb]{.5,.5,.5}\pm{0.05}}under¯ start_ARG 0.67 end_ARG ± 0.05 0.77¯±0.01plus-or-minus¯0.770.01\underline{0.77}{\color[rgb]{.5,.5,.5}\pm{0.01}}under¯ start_ARG 0.77 end_ARG ± 0.01 1.49±1.09plus-or-minus1.491.09\mathbf{-1.49}{\color[rgb]{.5,.5,.5}\pm{1.09}}- bold_1.49 ± 1.09

Table 2 reports the results of the second experiment. For the OOD-Struct scenario, we observe slightly worse results than for the IID case. All three methods show lower correlation, error reduction potential, and relative log likelihood.

Conversely, the OOD-Value scenario exhibited substantial performance gains relative to the IID case. Mean and max error reduction potential cross decisively exceed the UER-AUC0.5UER-AUC0.5\text{UER-AUC}\geq 0.5UER-AUC ≥ 0.5 threshold. Only the negative RLL values indicate poorly calibrated uncertainty estimates with respect to the actual prediction error. This is to be expected since the calibration set was sampled IID while the test set contains previously unseen target values—likely resulting in vastly different error scales.

When comparing the different UQ methods, the ensembles by themselves and the combination of ensembles and MVE seem to perform equally well. For both scenarios, OOD-struct and OOD-value, the ensembles seem to offer higher error reduction potential, while the combination seems to offer better calibrated uncertainty estimates, as indicated by the higher RLL values.

In summary, uncertainty-based filtering demonstrates a moderate error reduction effect on in-distribution data and structural outliers. Notably, the error reduction potential increases substantially under a distributional shift of the target values (see Figure 3). These results provide a foundation for filtering counterfactuals, where perturbations can be expected to create outliers with respect to both structure and target value.

Table 3: Test set results of 5 independent repetitions of computational experiments to evaluate uncertainty-based error reduction on various molecular property prediction datasets. The first row represents the previously introduced deterministic CLogP graph regression task, and the following rows represent various real-world molecular property regression datasets. Results are obtained by a GATv2 graph neural network and uncertainties are estimated by a method combining deep ensembles and mean-variance estimation.
Dataset Property R2superscript𝑅2absentR^{2}\uparrowitalic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ↑ ρ𝜌absent\rho\uparrowitalic_ρ ↑ UER-AUCmean𝑚𝑒𝑎𝑛UER-AUCabsent\underset{mean}{\text{UER-AUC}}\uparrowstart_UNDERACCENT italic_m italic_e italic_a italic_n end_UNDERACCENT start_ARG UER-AUC end_ARG ↑ UER-AUCmax𝑚𝑎𝑥UER-AUCabsent\underset{max}{\text{UER-AUC}}\uparrowstart_UNDERACCENT italic_m italic_a italic_x end_UNDERACCENT start_ARG UER-AUC end_ARG ↑ RLLRLLabsent\text{RLL}\uparrowRLL ↑
ClogP logP 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.58±0.17plus-or-minus0.580.170.58{\color[rgb]{.5,.5,.5}\pm{0.17}}0.58 ± 0.17 0.27±0.05plus-or-minus0.270.050.27{\color[rgb]{.5,.5,.5}\pm{0.05}}0.27 ± 0.05 0.66±0.22plus-or-minus0.660.220.66{\color[rgb]{.5,.5,.5}\pm{0.22}}0.66 ± 0.22 0.76±0.02plus-or-minus0.760.020.76{\color[rgb]{.5,.5,.5}\pm{0.02}}0.76 ± 0.02
AqSolDB[37] logS 0.88±0.02plus-or-minus0.880.020.88{\color[rgb]{.5,.5,.5}\pm{0.02}}0.88 ± 0.02 0.35±0.05plus-or-minus0.350.050.35{\color[rgb]{.5,.5,.5}\pm{0.05}}0.35 ± 0.05 0.24±0.02plus-or-minus0.240.020.24{\color[rgb]{.5,.5,.5}\pm{0.02}}0.24 ± 0.02 0.26±0.17plus-or-minus0.260.170.26{\color[rgb]{.5,.5,.5}\pm{0.17}}0.26 ± 0.17 0.45±0.03plus-or-minus0.450.030.45{\color[rgb]{.5,.5,.5}\pm{0.03}}0.45 ± 0.03
Lipop[49] logD 0.74±0.03plus-or-minus0.740.030.74{\color[rgb]{.5,.5,.5}\pm{0.03}}0.74 ± 0.03 0.15±0.06plus-or-minus0.150.060.15{\color[rgb]{.5,.5,.5}\pm{0.06}}0.15 ± 0.06 0.10±0.02plus-or-minus0.100.020.10{\color[rgb]{.5,.5,.5}\pm{0.02}}0.10 ± 0.02 0.22±0.12plus-or-minus0.220.120.22{\color[rgb]{.5,.5,.5}\pm{0.12}}0.22 ± 0.12 0.32±0.02plus-or-minus0.320.020.32{\color[rgb]{.5,.5,.5}\pm{0.02}}0.32 ± 0.02
COMPAS[45] rel. Ener. 0.90±0.05plus-or-minus0.900.050.90{\color[rgb]{.5,.5,.5}\pm{0.05}}0.90 ± 0.05 0.65±0.04plus-or-minus0.650.040.65{\color[rgb]{.5,.5,.5}\pm{0.04}}0.65 ± 0.04 0.37±0.03plus-or-minus0.370.030.37{\color[rgb]{.5,.5,.5}\pm{0.03}}0.37 ± 0.03 0.45±0.11plus-or-minus0.450.110.45{\color[rgb]{.5,.5,.5}\pm{0.11}}0.45 ± 0.11 0.66±0.03plus-or-minus0.660.030.66{\color[rgb]{.5,.5,.5}\pm{0.03}}0.66 ± 0.03
GAP 0.97±0.01plus-or-minus0.970.010.97{\color[rgb]{.5,.5,.5}\pm{0.01}}0.97 ± 0.01 0.44±0.05plus-or-minus0.440.050.44{\color[rgb]{.5,.5,.5}\pm{0.05}}0.44 ± 0.05 0.27±0.05plus-or-minus0.270.050.27{\color[rgb]{.5,.5,.5}\pm{0.05}}0.27 ± 0.05 0.59±0.20plus-or-minus0.590.200.59{\color[rgb]{.5,.5,.5}\pm{0.20}}0.59 ± 0.20 0.71±0.01plus-or-minus0.710.010.71{\color[rgb]{.5,.5,.5}\pm{0.01}}0.71 ± 0.01
QM9[31] Dip. Mom. 0.78±0.00plus-or-minus0.780.000.78{\color[rgb]{.5,.5,.5}\pm{0.00}}0.78 ± 0.00 0.57±0.01plus-or-minus0.570.010.57{\color[rgb]{.5,.5,.5}\pm{0.01}}0.57 ± 0.01 0.45±0.01plus-or-minus0.450.010.45{\color[rgb]{.5,.5,.5}\pm{0.01}}0.45 ± 0.01 0.76±0.03plus-or-minus0.760.030.76{\color[rgb]{.5,.5,.5}\pm{0.03}}0.76 ± 0.03 0.53±0.00plus-or-minus0.530.000.53{\color[rgb]{.5,.5,.5}\pm{0.00}}0.53 ± 0.00
HOMO 0.93±0.00plus-or-minus0.930.000.93{\color[rgb]{.5,.5,.5}\pm{0.00}}0.93 ± 0.00 0.54±0.02plus-or-minus0.540.020.54{\color[rgb]{.5,.5,.5}\pm{0.02}}0.54 ± 0.02 0.23±0.01plus-or-minus0.230.010.23{\color[rgb]{.5,.5,.5}\pm{0.01}}0.23 ± 0.01 0.61±0.10plus-or-minus0.610.100.61{\color[rgb]{.5,.5,.5}\pm{0.10}}0.61 ± 0.10 0.63±0.01plus-or-minus0.630.010.63{\color[rgb]{.5,.5,.5}\pm{0.01}}0.63 ± 0.01
LUMO 0.99±0.00plus-or-minus0.990.000.99{\color[rgb]{.5,.5,.5}\pm{0.00}}0.99 ± 0.00 0.48±0.02plus-or-minus0.480.020.48{\color[rgb]{.5,.5,.5}\pm{0.02}}0.48 ± 0.02 0.23±0.01plus-or-minus0.230.010.23{\color[rgb]{.5,.5,.5}\pm{0.01}}0.23 ± 0.01 0.67±0.02plus-or-minus0.670.020.67{\color[rgb]{.5,.5,.5}\pm{0.02}}0.67 ± 0.02 0.73±0.00plus-or-minus0.730.000.73{\color[rgb]{.5,.5,.5}\pm{0.00}}0.73 ± 0.00
GAP 0.97±0.00plus-or-minus0.970.000.97{\color[rgb]{.5,.5,.5}\pm{0.00}}0.97 ± 0.00 0.52±0.02plus-or-minus0.520.020.52{\color[rgb]{.5,.5,.5}\pm{0.02}}0.52 ± 0.02 0.25±0.01plus-or-minus0.250.010.25{\color[rgb]{.5,.5,.5}\pm{0.01}}0.25 ± 0.01 0.76±0.04plus-or-minus0.760.040.76{\color[rgb]{.5,.5,.5}\pm{0.04}}0.76 ± 0.04 0.68±0.00plus-or-minus0.680.000.68{\color[rgb]{.5,.5,.5}\pm{0.00}}0.68 ± 0.00
Refer to caption
Figure 4: Qualitative results of uncertainty-based counterfactual filtering for two example molecules. Predictions are made by a GATv2 graph neural network and uncertainties are estimated by a combination of ensembling and mean-variance estimation. The uncertainty threshold ξ20subscript𝜉20\xi_{20}italic_ξ start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT was chosen on the test set such that the 20% lowest uncertainty elements remain.

4.2.3 Error Reduction on Real-World Datasets.

Previous experiments were based on the ClogP dataset, which is a deterministically computable property and, therefore, relatively easy to regress. To assess the generalizability of these findings to more complex scenarios, computational experiments were conducted on multiple properties derived from the AqSolDB [37], Lipop [49], COMPAS [45], and QM9 [31] datasets. Based on the results of previous experiments, we use the GATv2 model to predict each property and a combination of ensembles and mean-variance estimation for the uncertainty quantification.

Table 3 presents the results for the real-world property regression datasets. Despite varying levels of predictivity (R2[0.74,0.99]superscript𝑅20.740.99R^{2}\in[0.74,0.99]italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∈ [ 0.74 , 0.99 ]) for the different datasets, some degree of error reduction can be reported for each one (UER-AUCmean[0.10,0.45]subscriptUER-AUCmean0.100.45\text{UER-AUC}_{\mathrm{mean}}\in[0.10,0.45]UER-AUC start_POSTSUBSCRIPT roman_mean end_POSTSUBSCRIPT ∈ [ 0.10 , 0.45 ]). Notably, the highest error reduction is found for the prediction of the Dipole Moment in the QM9 dataset with a mean error reduction of UER-AUCmean=0.45subscriptUER-AUCmean0.45\text{UER-AUC}_{\mathrm{mean}}=0.45UER-AUC start_POSTSUBSCRIPT roman_mean end_POSTSUBSCRIPT = 0.45 and a max error reduction of UER-AUCmax=0.78subscriptUER-AUCmax0.78\text{UER-AUC}_{\mathrm{max}}=0.78UER-AUC start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.78. In contrast, the lowest error reduction can be observed for the prediction of the Lipophilicity with a mean error reduction of only UER-AUCmean=0.10subscriptUER-AUCmean0.10\text{UER-AUC}_{\mathrm{mean}}=0.10UER-AUC start_POSTSUBSCRIPT roman_mean end_POSTSUBSCRIPT = 0.10.

The extent of error reduction potential does not appear to correlate strongly with the predictive performance of the model, as both the highest and lowest error reductions were associated with models demonstrating lower predictivity (R20.7superscript𝑅20.7R^{2}\approx 0.7italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≈ 0.7). In addition, even models with high predictivity, such as the prediction of the LUMO value (R2=0.99superscript𝑅20.99R^{2}=0.99italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.99), show moderate amounts of error reduction potential (UER-AUCmean=0.23subscriptUER-AUCmean0.23\text{UER-AUC}_{\mathrm{mean}}=0.23UER-AUC start_POSTSUBSCRIPT roman_mean end_POSTSUBSCRIPT = 0.23, UER-AUCmax=0.67subscriptUER-AUCmax0.67\text{UER-AUC}_{\mathrm{max}}=0.67UER-AUC start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.67). We hypothesize that the error reduction may be connected to the complexity of the underlying data distribution and the presence of labeling noise. The Lipophilicity dataset, for example, consists of inherently noisy experimental measurements while values for the dipole moment in the QM9 dataset were obtained by more precise DFT simulations.

Overall, the results of this experiment indicate that uncertainty threshold-based filtering can be used as an effective tool to decrease the overall prediction error even for complex properties, which may have been obtained through noisy measurements.

4.3 Experiments on Counterfactual Truthfulness

In the second part of the computational experiments, we investigate the potential of uncertainty-based filtering to improve the overall truthfulness of counterfactual explanations.

4.3.1 Improving Counterfactual Truthfulness.

For this experiment, we use the CLogP dataset, as the underlying property is deterministically calculable for any valid molecular graph. This availability of a ground truth oracle is necessary for the computation of the relative truthfulness as defined in Section 3.3. As before, we use the GATv2 model architecture and investigate the effectiveness of ensembles, mean-variance estimation, and the combination thereof. We split the dataset into a test set (10%), a calibration set (20%), and a train set (70%). All models are fitted with the train set, and uncertainty estimates are subsequently calibrated on the validation set. On the test set, we determine a single uncertainty threshold ξ20subscript𝜉20\xi_{20}italic_ξ start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT such that exactly the 20% elements with the lowest predicted uncertainties remain.

As described in Section 3.2, we generate counterfactual samples by ranking all graphs in a 1-edit neighborhood according to the prediction divergence and choosing the top 10 candidates. This set of counterfactuals is then filtered using the threshold ξ20subscript𝜉20\xi_{20}italic_ξ start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT and examined regarding its relative truthfulness.

The results of this experiment are reported in Table 4. A "Random" baseline was included as a control. As expected, this control’s randomly generated uncertainty values result neither in test set error reduction nor an increase of counterfactual truthfulness. All other uncertainty quantification (UQ) methods demonstrated a moderate potential for error reduction on both the test set and the set of counterfactuals (UER-AUC0.2UER-AUC0.2\text{UER-AUC}\geq 0.2UER-AUC ≥ 0.2). Furthermore, all UQ methods exhibited some capability to increase relative truthfulness when filtering with the uncertainty threshold ξ20subscript𝜉20\xi_{20}italic_ξ start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT. However, it has to be noted that the initial truthfulness in the unfiltered set of counterfactuals is rather high (up to 95%), leaving little room for further improvement. It is important to note, however, that the initial truthfulness in the unfiltered set of counterfactuals was relatively high (up to 95%), limiting the scope for further improvement. Notably, the mean variance estimation model displayed a substantially lower initial truthfulness (0.77), most likely due to its slightly lower predictivity.

In addition to the results for the fixed uncertainty threshold ξ20subscript𝜉20\xi_{20}italic_ξ start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT, Figure 5 visualizes the progression of mean error reduction and truthfulness results across a range of possible uncertainty thresholds. The plots show that for increasingly strict uncertainty thresholds, the relative counterfactual truthfulness also increases near-monotonically, reaching 100% with a small subset of 5% remaining counterfactuals.

In summary, we find that all UQ methods exhibit some capacity to improve the relative counterfactual truthfulness through uncertainty-based filtering. However, the results may be influenced by the already elevated values observed for the unfiltered set. Future work should explore more complex property prediction tasks with lower predictive performance and, consequently, lower starting points of counterfactual truthfulness.

Table 4: Results of 5 independent repetitions of computational experiments on the ClogP dataset to evaluate counterfactual truthfulness using a fixed uncertainty threshold ξ20subscript𝜉20\xi_{20}italic_ξ start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT determined on the test set. Results are obtained using a GATv2 graph neural network, and uncertainties are estimated by the combination of Deep Ensembles and MVE. Tr. Init. represents the initial percentage of truthful counterfactuals in the unfiltered set of all counterfactuals. Tr. Gain. shows the increase in the relative percentage of truthful counterfactuals after filtering according to the uncertainty threshold ξ20subscript𝜉20\xi_{20}italic_ξ start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT.
Method Test Set Counterfactuals
R2superscript𝑅2absentR^{2}\uparrowitalic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ↑ UER-AUCmeanmeanUER-AUCabsent\underset{\text{mean}}{\text{UER-AUC}}\uparrowundermean start_ARG UER-AUC end_ARG ↑ ρ𝜌absent\rho\uparrowitalic_ρ ↑ UER-AUCmeanmeanUER-AUCabsent\underset{\text{mean}}{\text{UER-AUC}}\uparrowundermean start_ARG UER-AUC end_ARG ↑ Tr. Init.(%)\underset{(\%)}{\text{Tr. Init.}^{\dagger}}\uparrowstart_UNDERACCENT ( % ) end_UNDERACCENT start_ARG Tr. Init. start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_ARG ↑ Tr. Gain(%)\underset{(\%)}{\text{Tr. Gain}^{\ddagger}}\uparrowstart_UNDERACCENT ( % ) end_UNDERACCENT start_ARG Tr. Gain start_POSTSUPERSCRIPT ‡ end_POSTSUPERSCRIPT end_ARG ↑
Random 1.00±0.01plus-or-minus1.000.011.00{\color[rgb]{.5,.5,.5}\pm{0.01}}1.00 ± 0.01 0.00±0.06plus-or-minus0.000.06-0.00{\color[rgb]{.5,.5,.5}\pm{0.06}}- 0.00 ± 0.06 0.01±0.03plus-or-minus0.010.03-0.01{\color[rgb]{.5,.5,.5}\pm{0.03}}- 0.01 ± 0.03 0.02±0.03plus-or-minus0.020.03-0.02{\color[rgb]{.5,.5,.5}\pm{0.03}}- 0.02 ± 0.03 0.95±0.03plus-or-minus0.950.030.95{\color[rgb]{.5,.5,.5}\pm{0.03}}0.95 ± 0.03 0.00±0.05plus-or-minus0.000.05-0.00{\color[rgb]{.5,.5,.5}\pm{0.05}}- 0.00 ± 0.05
MVE 0.98±0.02plus-or-minus0.980.020.98{\color[rgb]{.5,.5,.5}\pm{0.02}}0.98 ± 0.02 0.26±0.05plus-or-minus0.260.050.26{\color[rgb]{.5,.5,.5}\pm{0.05}}0.26 ± 0.05 0.53±0.15plus-or-minus0.530.150.53{\color[rgb]{.5,.5,.5}\pm{0.15}}0.53 ± 0.15 0.23±0.05plus-or-minus0.230.050.23{\color[rgb]{.5,.5,.5}\pm{0.05}}0.23 ± 0.05 0.77±0.17plus-or-minus0.770.170.77{\color[rgb]{.5,.5,.5}\pm{0.17}}0.77 ± 0.17 0.09±0.05plus-or-minus0.090.050.09{\color[rgb]{.5,.5,.5}\pm{0.05}}0.09 ± 0.05
DE 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.27±0.04plus-or-minus0.270.040.27{\color[rgb]{.5,.5,.5}\pm{0.04}}0.27 ± 0.04 0.45±0.12plus-or-minus0.450.120.45{\color[rgb]{.5,.5,.5}\pm{0.12}}0.45 ± 0.12 0.20±0.05plus-or-minus0.200.050.20{\color[rgb]{.5,.5,.5}\pm{0.05}}0.20 ± 0.05 0.94±0.03plus-or-minus0.940.030.94{\color[rgb]{.5,.5,.5}\pm{0.03}}0.94 ± 0.03 0.04±0.05plus-or-minus0.040.050.04{\color[rgb]{.5,.5,.5}\pm{0.05}}0.04 ± 0.05
DE+MVE 1.00±0.00plus-or-minus1.000.001.00{\color[rgb]{.5,.5,.5}\pm{0.00}}1.00 ± 0.00 0.28±0.03plus-or-minus0.280.030.28{\color[rgb]{.5,.5,.5}\pm{0.03}}0.28 ± 0.03 0.44±0.16plus-or-minus0.440.160.44{\color[rgb]{.5,.5,.5}\pm{0.16}}0.44 ± 0.16 0.23±0.05plus-or-minus0.230.050.23{\color[rgb]{.5,.5,.5}\pm{0.05}}0.23 ± 0.05 0.95±0.02plus-or-minus0.950.020.95{\color[rgb]{.5,.5,.5}\pm{0.02}}0.95 ± 0.02 0.05±0.02plus-or-minus0.050.020.05{\color[rgb]{.5,.5,.5}\pm{0.02}}0.05 ± 0.02
Refer to caption
Figure 5: Results for 5 independent repetitions of computational experiment on the ClogP dataset to evaluate counterfactual truthfulness. Individual results are plotted transparently in the background, and average curves are indicated with bold lines. All plots are based on the set of counterfactuals and show from left to right: The relative mean error reduction, the Truthfulness, and the percentage of remaining counterfactuals for different uncertainty thresholds. Results are obtained by a GATv2 graph neural network, and uncertainties are estimated by an ensemble of mean variance estimators.

4.3.2 Qualitative Filtering Results.

Besides a quantitative evaluation of counterfactual truthfulness, some qualitative results of uncertainty-based filtering are illustrated in Figure 4 for two example molecules. Uncertainty estimates are obtained by an ensemble of mean variance estimators. As before, the uncertainty threshold ξ20subscript𝜉20\xi_{20}italic_ξ start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT is chosen such that only 20% of the test set elements with the lowest uncertainty values remain.

For the first molecule, benzoic acid, only the highest-ranked counterfactual candidate A is filtered due to exceeding the uncertainty threshold. This exclusion intuitively makes sense since the added bond between oxygen and nitrogen is an uncommon configuration that is not represented in the underlying dataset and can be considered an out-of-distribution input. In contrast to this expected behavior, the counterfactual candidate B is not filtered but represents an equally uncommon configuration with one carbon being connected to two single-bonded oxygen at the same time. Notably, the model also predicts a significantly incorrect value for this counterfactual candidate.

For the second molecule, aspirin, the four highest-ranked counterfactuals are filtered based on their predicted uncertainty. The exclusion of the counterfactual candidate E also intuitively makes sense since it includes the same uncommon bond between nitrogen and oxygen. Excluded counterfactual candidate D also contains a rather uncommon substructure but, more importantly, is predicted highly inaccurately by the model. In contrast to these cases, the highest ranked counterfactual candidate C is excluded even though the model’s prediction is highly accurate, serving as an example of overly conservative filtering.

Overall, the qualitative examples illustrate that the uncertainty-based filtering can be effective in identifying and removing out-of-distribution input samples and generally inaccurate predictions. However, there are also cases in which the method fails by either failing to filter OOD samples or by being too conservative and filtering perfectly accurate predictions.

5 Discussion

Previous work has investigated the intersection of uncertainty estimation and counterfactual explainability predominantly in the context of image classification [8, 33, 2]. Schut et al. [33], for example, include an ensemble-based uncertainty estimate as a direct objective in the optimization of counterfactual explanations. The authors find this intervention to reduce the likelihood of generating uninformative out-of-distribution samples—or in the words of Freiesleben [10] to steer the generation toward true counterfactual explanations rather than mere adversarial examples.

In our work, we present a distinct perspective to the existing literature, which differs in two important aspects: We focus on (1) regression tasks in the (2) graph processing domain. In image processing, good counterfactual explanations require the modification of multiple pixel values in a semantically meaningful way. This is framed as a non-trivial optimization objective requiring substantial computational effort. In the graph processing domain, however, the limited number of possible graph modifications makes it computationally feasible to search for counterfactual candidates among a full enumeration of all possible perturbations. Consequently, uncertainty quantification does not have to be included in the generation process itself but instead may serve as a simple filter over this set of possible perturbations. Nevertheless, the objective is the same: to use uncertainty quantification methods to present higher-quality counterfactual explanations to the user.

Another key factor is the difference between classification and regression tasks. While classification enables binary assessments of correct and incorrect predictions, regression operates on a continuous error scale, requiring different metrics to assess the impact of uncertainty estimation—motivating our definitions of the UER-AUC and the counterfactual truthfulness.

Consistent with existing literature, our work demonstrates that incorporating uncertainty estimation improves the quality of counterfactual explanations. We specifically find that filtering high-uncertainty elements decreases the average error of the remaining set and increases overall truthfulness—meaning the explanation’s alignment with the underlying ground truth data distribution.

In our experiments, we find no substantial differences in the relative effectiveness of UQ interventions between three common graph neural network architectures (GCN, GATv2, GIN). Regarding the choice of the uncertainty estimation method, we find trust scores [8, 33] to be ill-suited to graph processing applications, most likely due to the unavailability of suitable distance metrics. We furthermore come to similar conclusions as previous authors [32, 13, 6] in that the simple application of model ensembles already proves relatively effective. While we find a combination between ensembles and mean variance estimation to be slightly beneficial on IID data, there seems to be no substantial difference in OOD test scenarios.

However, in this context, it is still important to mention the remaining limitations of this approach grounded in the non-perfect correlation of the prediction errors and estimated uncertainties. While quantitative results show that uncertainty-based filtering has a higher relative likelihood to remove truly high-error samples, qualitative results indicate it still occasionally fails to detect some high-error samples and mistakenly filters valid elements. Depending on the severity of these failure cases, it will largely depend on the concrete application whether the increased truthfulness reasonably justifies the loss of some valid explanations.

6 Conclusion

Counterfactual explanations can deepen the understanding of a complex model’s predictive behavior by illustrating which kinds of local perturbations a model is especially sensitive to. In the scientific domain of chemistry and material science, these explanations are often not only desirable to understand the model’s behavior but by extension to understand the structure-property relationships of the underlying data itself. To use counterfactuals to gain insights about the underlying data, the explanations truthfully must reflect the properties thereof.

In this work, we explore the potential of uncertainty estimation to increase the overall truthfulness of a set of counterfactuals by filtering those elements with particularly high predicted uncertainty. We conduct extensive computational experiments with different methods to investigate the error-reduction potential of various uncertainty estimation methods in different settings. We find that model ensembles provide strong uncertainty estimates in out-of-distribution test scenarios, while a combination of ensembles and mean variance provide the highest error reduction potential for in-distribution settings.

Based on these initial results, we conclude that uncertainty estimation presents a promising opportunity to increase the truthfulness of explanations—to make sure explanations not only represent the model’s behavior but the properties of the underlying data as well. An interesting direction for future research will be to see if uncertainty estimation can be employed equally beneficially to different explanation modalities, such as local attributional and global concept-based explanations.

{credits}

6.0.1 Acknowledgements

This work was supported by funding from the pilot program Core-Informatics of the Helmholtz Association (HGF).

6.0.2 \discintname

The authors have no competing interests to declare that are relevant to the content of this article.

References

  • [1] Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U. Rajendra Acharya, Vladimir Makarenkov, and Saeid Nahavandi. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243–297, December 2021.
  • [2] Javier Antor’an, Umang Bhatt, T. Adel, Adrian Weller, and José Miguel Hernández-Lobato. Getting a CLUE: A Method for Explaining Uncertainty Estimates. International Conference on Learning Representations, 2020.
  • [3] Mohit Bajaj, Lingyang Chu, Zihui Xue, J. Pei, Lanjun Wang, P. C. Lam, and Yong Zhang. Robust Counterfactual Explanations on Graph Neural Networks. ArXiv, July 2021.
  • [4] Christopher M. Bishop. Bayesian Neural Networks. Journal of the Brazilian Computer Society, 4:61–68, July 1997.
  • [5] Shaked Brody, Uri Alon, and Eran Yahav. How Attentive are Graph Attention Networks? In International Conference on Learning Representations, February 2022.
  • [6] Jonas Busk, Peter Bjørn Jørgensen, Arghya Bhowmik, Mikkel N Schmidt, Ole Winther, and Tejs Vegge. Calibrated uncertainty for molecular property prediction using ensembles of message passing neural networks. Machine Learning: Science and Technology, 3(1):015012, December 2021.
  • [7] Francesco D’ Angelo and Vincent Fortuin. Repulsive Deep Ensembles are Bayesian. In Advances in Neural Information Processing Systems, volume 34, pages 3451–3465. Curran Associates, Inc., 2021.
  • [8] Eoin Delaney, Derek Greene, and Mark T. Keane. Uncertainty Estimation and Out-of-Distribution Detection for Counterfactual Explanations: Pitfalls and Solutions, July 2021.
  • [9] Finale Doshi-Velez and Been Kim. Towards A Rigorous Science of Interpretable Machine Learning. arXiv:1702.08608 [cs, stat], March 2017.
  • [10] Timo Freiesleben. The Intriguing Relation Between Counterfactual Explanations and Adversarial Examples. Minds and Machines, 32(1):77–109, March 2022.
  • [11] Jakob Gawlikowski, Cedrique Rovile Njieutcheu Tassi, Mohsin Ali, Jongseok Lee, Matthias Humt, Jianxiang Feng, Anna Kruspe, Rudolph Triebel, Peter Jung, Ribana Roscher, Muhammad Shahzad, Wen Yang, Richard Bamler, and Xiao Xiang Zhu. A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(1):1513–1589, October 2023.
  • [12] Ethan Goan and Clinton Fookes. Bayesian Neural Networks: An Introduction and Survey. In Kerrie L. Mengersen, Pierre Pudlo, and Christian P. Robert, editors, Case Studies in Applied Bayesian Data Science: CIRM Jean-Morlet Chair, Fall 2018, pages 45–87. Springer International Publishing, Cham, 2020.
  • [13] Lior Hirschfeld, Kyle Swanson, Kevin Yang, Regina Barzilay, and Connor W. Coley. Uncertainty Quantification Using Neural Networks for Molecular Property Prediction. Journal of Chemical Information and Modeling, 60(8):3770–3780, August 2020.
  • [14] Matthias Kellner and Michele Ceriotti. Uncertainty quantification by direct propagation of shallow ensembles. Machine Learning: Science and Technology, 5(3):035006, July 2024.
  • [15] Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations. arXiv, 2017.
  • [16] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  • [17] G Landrum. RDKit: Open-source cheminformatics, 2010.
  • [18] D. Ley, Umang Bhatt, and Adrian Weller. δ𝛿\deltaitalic_δ-CLUE: Diverse Sets of Explanations for Uncertainty Estimates. ArXiv, April 2021.
  • [19] Dan Ley, Umang Bhatt, and Adrian Weller. Diverse, Global and Amortised Counterfactual Explanations for Uncertainty Estimates. Proceedings of the AAAI Conference on Artificial Intelligence, 36(7):7390–7398, June 2022.
  • [20] Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, and Jun Zhu. Understanding and Accelerating Particle-Based Variational Inference. In International Conference on Machine Learning, July 2018.
  • [21] Qiang Liu and Dilin Wang. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • [22] Luca Longo, Mario Brcic, Federico Cabitza, Jaesik Choi, Roberto Confalonieri, Javier Del Ser, Riccardo Guidotti, Yoichi Hayashi, Francisco Herrera, Andreas Holzinger, Richard Jiang, Hassan Khosravi, Freddy Lecue, Gianclaudio Malgieri, Andrés Páez, Wojciech Samek, Johannes Schneider, Timo Speith, and Simone Stumpf. Explainable Artificial Intelligence (XAI) 2.0: A manifesto of open challenges and interdisciplinary research directions. Information Fusion, 106:102301, June 2024.
  • [23] Ana Lucic, Maartje ter Hoeve, Gabriele Tolomei, M. de Rijke, and F. Silvestri. CF-GNNExplainer: Counterfactual Explanations for Graph Neural Networks. In International Conference on Artificial Intelligence and Statistics, February 2021.
  • [24] Phillip M. Maffettone, Pascal Friederich, Sterling G. Baird, Ben Blaiszik, Keith A. Brown, Stuart I. Campbell, Orion A. Cohen, Rebecca L. Davis, Ian T. Foster, Navid Haghmoradi, Mark Hereld, Howie Joress, Nicole Jung, Ha-Kyung Kwon, Gabriella Pizzuto, Jacob Rintamaki, Casper Steinmann, Luca Torresi, and Shijing Sun. What is missing in autonomous discovery: Open challenges for the community. Digital Discovery, 2(6):1644–1659, 2023.
  • [25] Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A Simple Baseline for Bayesian Uncertainty in Deep Learning. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [26] Tim Miller. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38, February 2019.
  • [27] Tri Minh Nguyen, Thomas P Quinn, Thin Nguyen, and Truyen Tran. Explaining Black Box Drug Target Prediction Through Model Agnostic Counterfactual Samples. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(2):1020–1029, March 2023.
  • [28] D.A. Nix and A.S. Weigend. Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), volume 1, pages 55–60 vol.1, June 1994.
  • [29] Danilo Numeroso and Davide Bacciu. MEG: Generating Molecular Counterfactual Explanations for Deep Graph Networks. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8, July 2021.
  • [30] Mario Alfonso Prado-Romero, Bardh Prenkaj, Giovanni Stilo, and Fosca Giannotti. A Survey on Graph Counterfactual Explanations: Definitions, Methods, Evaluation, and Research Challenges. ACM Comput. Surv., 56(7):171:1–171:37, April 2024.
  • [31] Raghunathan Ramakrishnan, Pavlo O. Dral, Matthias Rupp, and O. Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 1(1):140022, August 2014.
  • [32] Gabriele Scalia, Colin A. Grambow, Barbara Pernici, Yi-Pei Li, and William H. Green. Evaluating Scalable Uncertainty Estimation Methods for Deep Learning-Based Molecular Property Prediction. Journal of Chemical Information and Modeling, 60(6):2697–2717, June 2020.
  • [33] L. Schut, Oscar Key, R. McGrath, Luca Costabello, Bogdan Sacaleanu, Medb Corcoran, and Y. Gal. Generating Interpretable Counterfactual Explanations By Implicit Minimisation of Epistemic and Aleatoric Uncertainties. In International Conference on Artificial Intelligence and Statistics, March 2021.
  • [34] Maximilian Seitzer, Arash Tavakoli, Dimitrije Antic, and Georg Martius. On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks. In International Conference on Learning Representations. arXiv, 2022.
  • [35] Dominik Seuss. Bridging the Gap Between Explainable AI and Uncertainty Quantification to Enhance Trustability. ArXiv, May 2021.
  • [36] Laurens Sluijterman, Eric Cator, and Tom Heskes. Optimal training of Mean Variance Estimation neural networks. Neurocomputing, 597:127929, September 2024.
  • [37] Murat Cihan Sorkun, Abhishek Khetan, and Süleyman Er. AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Scientific Data, 6(1):143, August 2019.
  • [38] Hunter Sturm, Jonas Teufel, Kaitlin A. Isfeld, Pascal Friederich, and Rebecca L. Davis. Mitigating Molecular Aggregation in Drug Discovery with Predictive Insights from Explainable AI. 2023.
  • [39] Juntao Tan, Shijie Geng, Zuohui Fu, Yingqiang Ge, Shuyuan Xu, Yunqi Li, and Yongfeng Zhang. Learning and Evaluating Graph Neural Network Explanations based on Counterfactual and Factual Reasoning. In Proceedings of the ACM Web Conference 2022, pages 1018–1027, April 2022.
  • [40] Jonas Teufel and Pascal Friederich. Global Concept Explanations for Graphs by Contrastive Learning. In Luca Longo, Sebastian Lapuschkin, and Christin Seifert, editors, Explainable Artificial Intelligence, pages 184–208, Cham, 2024. Springer Nature Switzerland.
  • [41] Jonas Teufel, Luca Torresi, Patrick Reiser, and Pascal Friederich. MEGAN: Multi-explanation Graph Attention Network. In Luca Longo, editor, Explainable Artificial Intelligence, pages 338–360, Cham, 2023. Springer Nature Switzerland.
  • [42] Tishby, Levin, and Solla. Consistent inference of probabilities in layered networks: Predictions and generalizations. In International 1989 Joint Conference on Neural Networks, pages 403–409 vol.2, June 1989.
  • [43] Trung Trinh, Markus Heinonen, Luigi Acerbi, and Samuel Kaski. Input-gradient space particle inference for neural network ensembles. In The Twelfth International Conference on Learning Representations, October 2023.
  • [44] Sahil Verma, Varich Boonsanong, Minh Hoang, Keegan Hines, John Dickerson, and Chirag Shah. Counterfactual Explanations and Algorithmic Recourses for Machine Learning: A Review. ACM Comput. Surv., 56(12):312:1–312:42, October 2024.
  • [45] Alexandra Wahab, Lara Pfuderer, Eno Paenurk, and Renana Gershoni-Poranne. The COMPAS Project: A Computational Database of Polycyclic Aromatic Systems. Phase 1: Cata-Condensed Polybenzenoid Hydrocarbons. Journal of Chemical Information and Modeling, 62(16):3704–3713, August 2022.
  • [46] Geemi P. Wellawatte, Aditi Seshadri, and Andrew D. White. Model agnostic generation of counterfactual explanations for molecules. Chemical Science, 13(13):3697–3705, 2022.
  • [47] Scott A. Wildman and Gordon M. Crippen. Prediction of Physicochemical Parameters by Atomic Contributions. Journal of Chemical Information and Computer Sciences, 39(5):868–873, September 1999.
  • [48] Peter Wills and François G. Meyer. Metrics for graph comparison: A practitioner’s guide. PLoS ONE, 15(2):e0228728, February 2020.
  • [49] Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, and Vijay Pande. MoleculeNet: A benchmark for molecular machine learning. Chemical Science, 9(2):513–530, January 2018.
  • [50] Keyulu Xu*, Weihua Hu*, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph Neural Networks? In International Conference on Learning Representations, September 2018.
  • [51] Carlos Zednik and Hannes Boelsen. Scientific Exploration and Explainable Artificial Intelligence. Minds and Machines, 32(1):219–239, March 2022.