SPACE: SPike-Aware Consistency Enhancement for Test-Time Adaptation in Spiking Neural Networks

Xinyu Luo, Kecheng Chen, Pao-Sheng Vincent Sun, Chris XING TIAN,
Arindam Basu, Haoliang Li
Department of Electrical Engineering, City University of Hong Kong
{[email protected]}

Abstract

Spiking Neural Networks (SNNs), as a biologically plausible alternative to Artificial Neural Networks (ANNs), have demonstrated advantages in terms of energy efficiency, temporal processing, and biological plausibility. However, SNNs are highly sensitive to distribution shifts, which can significantly degrade their performance in real-world scenarios. Traditional test-time adaptation (TTA) methods designed for ANNs often fail to address the unique computational dynamics of SNNs, such as sparsity and temporal spiking behavior. To address these challenges, we propose SPike-Aware Consistency Enhancement (SPACE), the first source-free and single-instance TTA method specifically designed for SNNs. SPACE leverages the inherent spike dynamics of SNNs to maximize the consistency of spike-behavior-based local feature maps across augmented versions of a single test sample, enabling robust adaptation without requiring source data. We evaluate SPACE on multiple datasets, including CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C and DVS Gesture-C. Furthermore, SPACE demonstrates strong generalization across different model architectures, achieving consistent performance improvements on both VGG9 and ResNet11. Experimental results show that SPACE outperforms state-of-the-art methods, highlighting its effectiveness and robustness in real-world settings.

1 Introduction

Recent advancements in neuroscience-inspired computing have placed Spiking Neural Networks (SNNs) at the forefront of research as a biologically plausible alternative to traditional Artificial Neural Networks (ANNs). While ANNs have achieved remarkable success in various domains, their reliance on dense, black-box architectures often limits their interpretability and energy efficiency. In contrast, SNNs emulate the sparse, event-driven dynamics of biological neurons, providing several advantages in terms of computational efficiency, temporal processing, and explainability [4, 12, 35]. However, the unique features that make SNNs advantageous—such as their reliance on temporal information and sparse spiking—also pose significant challenges in dynamic real-world environments. SNNs encode information dynamically across time, making them highly sensitive to changes in input data distribution [2]. This sensitivity can result in drastic changes to spiking behavior across the network, ultimately degrading performance when faced with distribution shifts. Understanding and enhancing the robustness of SNNs under such conditions is therefore a critical research direction.

In real-world applications, it is common that the distribution of the training data (a.k.a. source domain) and the test data (a.k.a. target domain) are different due to variations in lighting, background, sensor noise, or environmental conditions. This phenomenon, known as domain shift, can significantly degrade the performance of machine learning models [15, 23]. SNNs are particularly vulnerable to domain shifts, as even minor changes in input data can disrupt the temporal dynamics and spiking behavior in intermediate layers, leading to poor generalization. Traditional test-time adaptation (TTA) solutions such as retraining or fine-tuning on target domain data with help of source data related proxy task [38], statistics of source data [32], or a batch of test data [39], are often impractical in real-world scenarios due to the unavailability of source data or other test samples, particularly in privacy-sensitive or resource-constrained environments. Source-free and single-instance TTA methods [9, 21, 43] offer a promising alternative by enabling the model to adapt to the target domain during inference without access to source data, which is typically defined (and also in this work) as one iteration of training adaptation with one test sample.

Among existing source-free and single-instance TTA methods, MEMO [43] and SITA [21] are two representative approaches, but both face fundamental limitations when applied to SNNs. MEMO optimizes the entropy of the averaged prediction across multiple augmented views of a test sample, enforcing output probability consistency. However, this strategy may not be optimal for SNNs. In SNNs, output probability distributions reflect only high-level semantic information and fail to capture the fine-grained spiking dynamics that encode essential features which prevents MEMO from effectively mitigating domain shifts within SNNs’ internal representations. On the other hand, SITA adapts batch normalization (BN) statistics by shifting them toward those computed from test-time augmentations, but this approach is also unsuitable for SNNs. In SNNs, many models lack BN layers [26, 42], which limits the applicability of BN-based adaptation methods such as SITA in SNNs. Additionally, due to the sparsity of SNN activations, modifying BN statistics does not significantly alter neuronal spiking patterns, further reducing its ability to facilitate effective adaptation.

In this work, we propose a novel method names SPike-Aware Consistency Enhancement (SPACE), which is the first source-free and single-instance TTA approach specifically designed for SNNs. SPACE performs adaptation using only a single test point without access to source data, making it particularly suitable for real-world scenarios where source data are unavailable or privacy-sensitive. By leveraging the inherent spike dynamics of SNNs, SPACE maximizes the consistency in spike-behavior-based local feature maps across augmented samples of the same input, ensuring robust and efficient adaptation in deep SNNs. Unlike existing TTA methods, which often overlook the unique characteristics of SNNs, SPACE directly exploits spike-based representations for adaptation. Our contributions can be summarized as follows

1.

To the best of our knowledge, we are the first TTA method tailored for SNNs, which operates effectively using only a single test sample. This addresses the unique challenges of adapting SNNs in scenarios where source data are unavailable, while maintaining efficiency and scalability.
2.

Our method introduces a consistency-driven optimization framework to enhance the similarity of spike-behavior-based local feature maps across augmented samples. By leveraging spike dynamics, this approach ensures robust adaptation in deep SNNs.
3.

To assess the effectiveness and robustness of our proposed method, we perform extensive experiments across multiple benchmarks, including CIFAR-10, CIFAR-100 [24], Tiny-ImageNet (a modified subset of the ImageNet dataset) [6], and the neuromorphic dataset DVS Gesture [1]. Furthermore, we validate the adaptability of our method across different model architectures, specifically testing on VGG9 [37] and ResNet11 [14]. The results demonstrate that our approach not only achieves consistent performance improvements across datasets but also generalizes well to different network structures, showcasing its broad applicability and robustness.

2 Related Work

2.1 SNN-based Deep Learning

Over the years, significant progress has been made in SNN-based deep learning. Many approaches primarily relied on converting pre-trained ANN models into SNNs, enabling SNNs to inherit the representational power of ANNs while benefiting from spike-based computations [3, 8, 17, 20]. However, such conversion methods often suffer from performance degradation due to differences in activation dynamics. To overcome this, direct training of SNNs using surrogate gradient methods has gained traction, enabling end-to-end optimization of spiking models [7, 11, 29, 31, 42]. Rathi et al. [33] proposed a hybrid method that combines conversion and surrogate gradient-based method, achieving state-of-the-art performance. These advancements have not only solved the challenge owing to the non-differentiable nature of spike functions and the the inherent complexity of temporal dynamics, but also led to the development of sophisticated SNN architectures, such as spiking convolutional networks [26, 40] and spiking recurrent networks [41, 44], which have demonstrated promising results in tasks like image classification, speech recognition, and event-driven sensor data analysis.

A critical limitation of current SNN-based deep learning approaches is their poor robustness to testing-time variations. Unlike ANNs, which operate on continuous activations and can leverage well-established testing-time adaptation techniques, SNNs are highly sensitive to input perturbations due to their dependence on precise spike timing and temporal dynamics. According to our experimental results presented in Sec. 4, existing SNN models, which are typically trained offline, lack mechanisms to dynamically adapt to changing environments or incoming data distributions at testing time. Addressing these shortcomings is essential for deploying SNNs in practical applications, particularly those requiring real-time adaptability.

2.2 Test-time Adaptation

Test-time adaptation (TTA) is a rapidly growing area in machine learning that tackles the challenge of adapting pre-trained models to unseen or shifted distributions during inference [19, 30, 38, 39]. Test-time training (TTT) [38] performs online updates to model parameters using supervised proxy task on the test data. However, the dependence on source data makes these methods impractical in scenarios where the source domain is unavailable during deployment due to privacy or storage constraints. To overcome source dependency, source-free TTA methods have been proposed, which perform adaptation using only test-time inputs. TENT [39] utilizes a straightforward yet effective entropy minimization approach to optimize batch normalization parameters during test time, without relying on any proxy task during training. Lee et al. [27] employs pseudo-labeling to adjust model predictions.

Although these approaches eliminate the need for source data, they often assume batch-level adaptation, relying on test-time statistics computed over multiple samples. This assumption limits their applicability in real-world settings where only single test points are available at a time. MEMO [43] addresses this scenario, proposing minimizing the marginal output distribution in the increases of the single test point. The objective in MEMO encourages the model to produce consistent predictions across different augmentations, thus enforcing the invariances encoded in these augmentations, while also maintaining confidence in its predictions. Another TTA method, SITA [21], adapts batch normalization statistics at inference by shifting them toward those computed from a batch of augmented versions of the single test instance.

MEMO has shown that enforcing consistency among augmented samples improves TTA performance. However, in deep SNNs, where information is transmitted through discrete spikes, backpropagation tied to output objectives suffers from temporal information loss, making effective TTA challenging. Moreover, MEMO fails to capture the domain shift-induced variations in the internal behavior of the SNN, such as changes in neuron activations and spike patterns, which are critical for robust adaptation in SNNs. Although SITA avoids backpropagation, it also struggles to accurately model the intrinsic temporal dynamics of SNNs, limiting its ability to align augmented samples effectively. To overcome these issues, we enforce consistency by directly aligning the spike patterns and behaviors of augmented samples within the SNN framework. Unlike BN-dependent methods like SITA, our approach works regardless of the presence of BN layers, ensuring broader compatibility with various SNN architectures.

3 Proposed Method

3.1 Preliminary

SNNs emulate biological neurons by transmitting and encoding information through discrete spikes. Among various neuron models, the Leaky Integrate-and-Fire (LIF) neuron [5] is widely adopted for its balance between biological realism and computational efficiency. Its membrane potential dynamics follow

\tau_{m}\frac{U(t)}{dt}=-U(t)+RI(t),

(1)

where $U(t)$ represents the membrane potential of the neuron at time $t$ , $\tau_{m}$ is the membrane time constant, $R$ denotes the input resistance and $I(t)$ is the input current received from pre-synaptic neurons or inputs. When $U(t)$ exceeds a predefined threshold $U_{th}$ , the neuron emits a spike and its membrane potential is reset to a resting value, typically 0 or $U(t)-U_{th}$ . Following [11], Eq. 1 is discretized as

u^{t}_{i}=(1-\frac{1}{\tau_{m}})u_{i}^{t-1}+\frac{1}{\tau_{m}}\sum_{j}w_{ij}o_% {j}^{t}.

(2)

Here, $j$ is the index of pre-synaptic neurons, $o_{j}$ is the binary spike activation, and $w_{ij}$ stands for weight connections between pre- and post-neurons. Fig. 1 illustrates the LIF neuron model, depicting membrane potential evolution and spike generation in response to input spikes [10].

The spiking mechanism in SNNs introduces non-differentiability, making it difficult to apply standard gradient-based optimization methods for training or test-time adaptation. To address this, surrogate gradient methods approximate the non-differentiable spike function with a smooth surrogate during the backward pass. A simple approach is the shifted Heaviside step function, where the gradient of $o_{j}$ , the spiking activation function of neuron $j$ , with respect to $U$ is defined as

\frac{\partial o_{j}}{\partial U}\triangleq\begin{cases}1,&\text{if }U\geq U_{% th},\\ 0,&\text{if }U<U_{th}.\end{cases}

(3)

While this is not an exact solution, it is valid because a reset occurs after a spike is generated when $U\geq U_{th}$ .

These foundational techniques for SNN training, including LIF dynamics and surrogate gradient approximations, form the basis for our proposed method, which focuses on enhancing TTA by ensuring consistency in spiking behavior across augmented samples.

Refer to caption — Figure 1: Illustration of spike and membrane potential dynamics in LIF neurons. When the membrane potential reaches the threshold, it is reset by subtracting the threshold value, triggering the neuron to fire a spike.

3.2 SPike-Aware Consistency Enhancement

Algorithm Design Overview. The proposed algorithm through spike-aware consistency enhancement (SPACE) for test-time adaptation is designed to improve test-time robustness of SNNs, as shown in Fig. 2. We first generate an augmented batch from a single test sample using various augmentation techniques, introducing diversity while retaining the core characteristics of the original sample. This augmented batch is passed through the model to obtain local feature maps, represented by spike counts over time. Next, the model is adapted by maximizing the similarity across the feature maps of the augmented samples, promoting consistency in feature representations. Finally, the adapted model predicts the label of the original test sample, ensuring robust performance under test-time conditions. This approach enhances the model’s ability to generalize to unseen test samples, particularly in shifted domains. The overall method is presented in Algorithm 1, and the details are introduced below. The code will be made available after acceptance.

Algorithm 1 SPACE for Test-Time Adaptation in SNNs

1: Input: Test sample

\mathbf{x}

, augmentation functions

\mathcal{A}=\{a_{1},\dots,a_{N}\}

, SNN model with parameters

\theta

, and learning rate

\eta

2: Output: Prediction

y^{*}

\mathbf{x}

using the adapted model

3: Step 1: Generate augmented batch

4: Initialize augmented batch

\mathcal{B}=\emptyset

5: for each

a_{k}

in subset

\{a_{1},\dots,a_{M}\}

\mathcal{A}

6: Apply augmentation:

\mathbf{x}_{k}=a_{k}(\mathbf{x})

7: Add

\mathbf{x}_{k}

\mathcal{B}

8: end for

9: Step 2: Extract Feature Maps

10: Pass

\mathcal{B}

through the SNN feature extractor

11: Obtain feature maps:

12:

\mathcal{F}=\{\mathbf{F}(\mathbf{x}_{1}),\mathbf{F}(\mathbf{x}_{1}),\dots,% \mathbf{F}(\mathbf{x}_{M})\}

13: Step 3: Compute Similarity

14: for each pair

(\mathbf{F}(\mathbf{x}_{i}),\mathbf{F}(\mathbf{x}_{j}))

\mathcal{F}

15: Compute similarity

\bar{\mathcal{S}}(i,j|\mathbf{x})

using Eq. 4

\sim

Eq. 6

16: end for

17: Step 4: Compute loss and update model

18: Compute loss

\mathcal{L}(\theta;\mathbf{x})

using Eq. 7

19: Update parameters using Eq. 8

20: Step 5: Predict via adapted model

21: Predict via:

y^{*}=argmax_{y}p_{\theta^{*}}(y|\mathbf{x})

22: Return: Prediction result

y^{*}

Spike-Aware Feature Maps. SPACE aims to adapt pre-trained SNN-based models $M_{\theta}$ (with parameters $\theta\in\Theta$ ) during inference to improve performance under distribution shifts, without requiring ground truth labels or access to source data. No specific training process is required, nor do we impose particular constraints on the model. The only assumptions are that $\theta$ can be adjusted and that the model produces intermediate feature maps $\mathbf{F}(\mathbf{x})$ , which is differentiable with respect to $\theta$ and can be utilized for further analysis and adaptation. Here, $\mathbf{x}\in\mathcal{X}$ is a single test point that is presented to $M_{\theta}$ . It is worth noting that this is a reasonable assumption, supported by a recent advance [34] that suggest gradient back-propagation can be implemented on spiking neuromorphic hardware. Aligned with previous work [21, 43], we achieve test-time robustness by leveraging data augmentations and a self-supervised adaptation objective. In our work, a set of augmentation functions with different intensities selected from $\mathcal{A}\triangleq\{a_{1},\dots,a_{N}\}$ is applied to a single test input $\mathbf{x}$ , forming an augmented batch of samples $\mathcal{B}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{M}\},M\leq N$ . Using this batch, we align the spike dynamics extracted from augmentations, ensuring that the model produces reliable and robust predictions in the presence of distribution shifts.

Different from MEMO [43], which adapts the model during test time via the conditional output distribution $p_{\theta}(y|\mathbf{x})$ of the model’s last layer, SPACE leverages the dynamic behavior of intermediate layers in the SNN that plays a crucial role in determining the temporal information and semantic representation of the final output. SNNs encode information through temporal dynamics and spike sparsity. These behavioral characteristics are particularly prominent in the intermediate layers and can be represented by feature maps $\mathbf{F}(\mathbf{x})$ . Unlike ANNs, these characteristics in SNNs cannot be fully captured by the output probability distribution alone. Another advantage of intermediate feature maps is its highly sensitivity to changes in the input distribution due to the event-driven nature of SNNs. When the input data undergoes a distribution shift, the spiking patterns in the intermediate layers may change significantly, thereby impacting the final output. Optimizing the consistency of spiking behavior in intermediate layers can effectively mitigate feature variations caused by distribution shifts, thereby enhancing the robustness of models.

To leverage the feature maps from the intermediate layers of the SNN, we focus on the spike activity over the entire time window. As described in Sec. 3.1, each neuron $j$ in an SNN can either fire a spike (outputting 1) or remain silent (outputting 0) at each time step $t$ , represented by the binary function $o_{j}^{t}$ . The temporal spike patterns of the network are captured in the feature map $\mathbf{O(\mathbf{x}_{i})}\in\mathbb{R}^{T\times C\times H\times W}$ , where $T$ denotes the number of time steps, $C$ represents the number of channels, and $H$ and $W$ are the spatial dimensions of the local feature maps. $\mathbf{x}_{i}$ refers to the $i$ -th sample in the augmented batch. Our primary objective is to use the total spike counts of the feature maps from either the last layer or the penultimate layer of the feature extraction network (typically a CNN [25]), depending on whether the final feature map is pooled into a 1 $\times$ 1 matrix. Let $\mathcal{F}=\{\mathbf{F}(\mathbf{x}_{1}),\mathbf{F}(\mathbf{x}_{1}),\dots,% \mathbf{F}(\mathbf{x}_{M})\}$ denote the collection of feature maps, where each $\mathbf{F}(\mathbf{x}_{i})\in\mathbb{R}^{C\times D}$ represents the total spike counts of neurons across different channels, with $D=H\times W$ indicating the spatial dimensionality.

There are several reasons for choosing the total spike counts as the observation metric. First, SNNs typically operate over a time window ranging from 25 to thousands time steps [4, 13, 26, 33, 36], meaning that spike patterns across many time steps may be identical due to the repetitive nature of the spike dynamics. By comparing the cumulative spike counts instead of individual time-step patterns, we can effectively capture the temporal accumulation of spike activity while reducing redundancy. Second, the total spike counts reflects differences between augmented samples based on their overall spike dynamics, which is essential for our goal of aligning feature maps across augmentations. Finally, this representation significantly reduces the computational cost, as it condenses the temporal information into an aggregate value without requiring a step-by-step comparison of spike patterns.

Feature Maps Alignment. The primary objective is to encourage the model to focus on invariant features that are crucial for classification, rather than being influenced by noise or minor transformations introduced by augmentations or distribution shifts in the training data. When the feature maps of augmented samples are highly similar, the model becomes less sensitive to variations at test time, thereby improving its ability to generalize to new, unseen, and unlabeled data. To measure the similarity between feature representations of the $i$ -th and $j$ -th samples, we employ channel-wise similarity between corresponding local feature vectors, $\mathbf{F}_{c}(\mathbf{x}_{i})$ and $\mathbf{F}_{c}(\mathbf{x}_{j})$ . Specifically, we first normalize local feature vectors using a softmax function along the spatial dimensions, ensuring that the features are transformed into a probability distribution:

\mathbf{P}_{c}(\mathbf{x}_{i})=\textit{softmax}(\mathbf{F}_{c}(\mathbf{x}_{i})).

(4)

Here, $\mathbf{F}_{c}(\mathbf{x}_{i})$ represent the local feature vectors corresponding to channel $c$ of the augmented sample $\mathbf{x}_{i}$ . This normalization highlights important spatial locations by amplifying prominent feature values and suppressing noise, enhancing feature distribution stability while enabling the subsequent similarity computation to capture both spatial and temporal characteristics effectively. To quantify the similarity between $\mathbf{F}(\mathbf{x}_{i})$ and $\mathbf{F}(\mathbf{x}_{j})$ , we compute the average of channel-wise weighted inner product, defined as

	$\displaystyle\mathcal{S}_{c}(i,j\|\mathbf{x})=\sum_{d=1}^{D}\mathbf{P}_{c,d}(% \mathbf{x}_{i})\cdot\mathbf{P}_{c,d}(\mathbf{x}_{j}),$		(5)
	$\displaystyle\bar{\mathcal{S}}(i,j\|\mathbf{x})=\frac{1}{C}\sum_{c=1}^{C}% \mathcal{S}_{c}(i,j\|\mathbf{x}).$		(6)

The similarity measures the degree of overlap between the feature distributions, with higher values indicating greater consistency in spike-based representations across augmentations. Based on this similarity measure, we define the loss function for model adaptation given as

\mathcal{L}(\theta;\mathbf{x})\triangleq\sum_{1\leq j<i\leq M}\left(1-\bar{% \mathcal{S}}(i,j|\mathbf{x})\right).

(7)

Then we adapt the model via gradient descent for only one iteration given as

\theta^{*}\leftarrow\theta-\eta\nabla_{\theta}\mathcal{L}(\theta;\mathbf{x}),

(8)

where $\eta$ is learning rate here. By minimizing this loss function, we enforce consistency across the feature maps of different augmentations. This reduces the variability induced by augmentations, ensuring that the model learns stable and robust feature representations. Furthermore, by promoting feature consistency, this approach serves as a regularization mechanism, preventing over-fitting to specific augmentations and encouraging the model to focus on more fundamental and generalizable features.

While the proposed similarity measure effectively captures feature distribution overlap, we further investigated whether incorporating higher-dimensional feature relationships could enhance the consistency measure. Specifically, we experimented with a kernel embedding approach combined with Maximum Mean Discrepancy (MMD) (see Sec. 6 in Supplementary Material). However, this method did not yield noticeable performance improvements and introduced additional computational overhead, suggesting that the original similarity measure is already sufficient for our test-time adaptation.

Finally, we predict the classification via an updated model.

Discussion. In SNNs, each local feature map captures distinct aspects of the input data and exhibits different spike patterns, both spatially and temporally. Since channels may have different active neurons, directly comparing the entire feature map fails to account for this variability. Instead, computing the similarity per channel and averaging the results focuses on the stability of feature representations within each channel. This isolates the contributions of active neurons, ensuring the similarity measure reflects meaningful changes rather than noise. It also minimizes the impact of irregular spike behavior or augmentation-induced noise in individual channels. Our experiments in Sec. 4.2 show that directly comparing entire feature maps leads to ineffective TTA, reinforcing the importance of the per-channel similarity approach. Given that each channel learns distinct features, averaging the similarity within each channel provides a more reliable measure of feature map consistency, improving TTA performance.

4 Experimental Results

Datasets. The experiments here were conducted on three benchmark datasets: CIFAR-10-C, CIFAR-100-C, and Tiny-ImageNet-C [15], each containing a variety of common image corruptions. CIFAR-10-C includes 18 corruption types, while CIFAR-100-C features 19 types, covering weather, noise, and digital distortions. Both CIFAR-10-C and CIFAR-100-C consist of 10 and 100 classes of 32 $\times$ 32 color images, respectively, similar to CIFAR-10 and CIFAR-100 [24]. Tiny-ImageNet-C, derived from the Tiny-ImageNet dataset, contains 15 corruption types and 200 classes of 64 $\times$ 64 color images. In addition, we also considered a neuromorphic dataset, DVS Gesture-C [18], and the related experiments are provided in the Supplementary Material Sec. 7.

Experimental Setup. All experiments were conducted under the highest corruption level (level=5). Following the experimental setup of MEMO [43], we applied AugMix [16] augmentation with a batch size of 32 to each test sample to generate a diverse set of augmented samples. The backbone model for most experiments is an SNN-based VGG architecture with Batch-Normalization Through Time (BNTT) [22], featuring BN layers at each time step, with 25 inference steps. To demonstrate the general applicability of the proposed SPACE method, additional experiments were conducted using an SNN-based ResNet-11 model [26] on the CIFAR-10-C dataset, with inference steps set to 30.

Baseline TTA Methods. We compared our approach with models pre-trained on clean datasets without any test-time adaptation (referred to as w.o. TTA). Moreover, since few TTA methods are specifically designed for SNNs, we selected two most representative traditional TTA methods, MEMO [43] and SITA [21], as our baselines.

4.1 Main Results

Shifted Domain		w.o. TTA	SITA	MEMO	SPACE
noise	gaussian	72.38%	73.06%	77.73%	77.98%
	shot	74.70%	74.15%	79.50%	79.34%
	speckle	71.75%	71.67%	77.28%	77.38%
	impulse	58.57%	58.41%	65.74%	69.41%
blur	defocus	63.05%	62.94%	65.61%	71.59%
	gaussian	57.03%	57.36%	59.93%	68.31%
	motion	64.44%	64.36%	66.24%	72.14%
	zoom	71.33%	70.72%	72.61%	74.67%
weather	snow	76.32%	76.67%	77.38%	78.43%
	fog	43.57%	43.40%	45.47%	52.80%
	frost	75.72%	75.24%	78.64%	79.59%
digital	brightness	82.44%	82.51%	83.05%	83.22%
	contrast	22.54%	22.02%	23.83%	23.85%
	elastic_transform	75.01%	74.74%	76.24%	75.49%
	pixelate	70.25%	69.62%	73.25%	76.24%
	jpeg_compression	84.28%	84.16%	86.05%	82.88%
	spatter	78.42%	78.61%	79.75%	77.52%
	saturate	80.85%	80.97%	82.03%	82.54%
Average		67.93%	67.81%	70.53%	72.41%

Table 1: Performance comparison of different methods on CIFAR-10-C dataset at level 5 corruption, using the VGG9 architecture.

In this section, we compare the performance of the proposed SPACE method with two existing approaches: MEMO [43] and SITA [21], which are current state-of-the-art methods for source-free and single-instance TTA. While MEMO improves TTA performance by enforcing consistency among augmented samples, it faces significant challenges in SNNs due to the loss of temporal information when relying on backpropagation. Similarly, SITA, despite not using backpropagation, fails to capture the intrinsic temporal dynamics of SNNs, rendering it ineffective. To demonstrate the broad applicability of the proposed method, we evaluate all methods on CIFAR-10-C, CIFAR-100-C, and Tiny-ImageNet-C datasets using the VGG architecture, and on CIFAR-10-C with the ResNet-11 architecture to demonstrate the broad applicability of our proposed method.

4.1.1 Performance Comparison on VGG Model

Shifted Domain		w.o. TTA	SITA	MEMO	SPACE
noise	gaussian	42.51%	43.01%	43.46%	44.71%
	shot	45.40%	44.83%	45.41%	46.73%
	speckle	43.35%	43.14%	43.53%	44.84%
	impulse	25.50%	25.48%	26.07%	27.99%
blur	defocus	42.15%	42.23%	42.47%	43.35%
	gaussian	37.76%	37.88%	38.29%	39.19%
	motion	41.87%	42.11%	42.55%	43.46%
	zoom	46.77%	46.93%	47.21%	48.42%
	glass	42.84%	43.29%	43.90%	44.47%
weather	snow	44.88%	44.49%	44.74%	45.70%
	fog	16.51%	16.65%	16.76%	17.62%
	frost	44.12%	43.84%	44.34%	44.98%
digital	brightness	49.36%	49.53%	49.95%	50.41%
	contrast	5.42%	5.56%	5.62%	5.69%
	elastic_transform	51.33%	50.99%	51.24%	51.45%
	pixelate	52.35%	52.48%	52.82%	53.81%
	jpeg_compression	58.47%	58.71%	59.29%	58.96%
	spatter	47.76%	47.82%	48.63%	48.95%
	saturate	46.91%	47.17%	47.17%	48.22%
Average		41.33%	41.38%	41.76%	42.58%

Table 2: Performance comparison of different methods on CIFAR-100-C dataset at level 5 corruption, using the VGG11 architecture.

Shifted Domain		w.o. TTA	SITA	MEMO	SPACE
noise	gaussian	12.43%	12.71%	13.64%	16.71%
	shot	15.20%	15.17%	16.45%	19.44%
	impulse	8.74%	8.87%	9.51%	11.72%
	defocus	7.58%	7.56%	7.51%	9.62%
blur	motion	14.48%	14.57%	14.38%	16.57%
	zoom	13.59%	13.94%	13.92%	15.61%
	glass	6.50%	6.57%	6.60%	7.31%
weather	snow	15.58%	15.72%	15.97%	18.66%
	fog	5.27%	5.32%	4.91%	6.01%
	frost	18.06%	18.05%	17.83%	21.02%
digital	brightness	15.01%	14.82%	14.74%	17.53%
	contrast	1.32%	1.38%	1.44%	1.51%
	elastic_transform	20.53%	21.45%	21.00%	21.96%
	pixelate	31.62%	32.20%	31.39%	31.89%
	jpeg_compression	31.15%	31.51%	30.82%	31.63%
Average		14.47%	14.66%	14.67%	16.48%

Table 3: Performance comparison of different methods on Tiny-ImageNet-C dataset at level 5 corruption, using the VGG11 architecture.

Shifted Domain		w.o. TTA	MEMO	SPACE
noise	gaussian	72.88%	74.21%	75.90%
	shot	74.49%	75.50%	77.81%
	speckle	73.12%	74.32%	76.75%
	impulse	49.59%	51.35%	56.85%
blur	defocus	67.77%	68.11%	68.66%
	gaussian	63.57%	64.11%	64.15%
	motion	65.14%	65.37%	66.02%
	zoom	73.06%	73.96%	73.82%
weather	snow	78.11%	78.32%	78.74%
	fog	42.84%	43.47%	45.47%
	frost	73.12%	73.39%	74.04%
digital	brightness	81.71%	81.82%	81.28%
	contrast	12.72%	13.63%	14.94%
	elastic_transform	74.88%	75.54%	76.79%
	pixelate	77.08%	77.54%	80.32%
	jpeg_compression	82.47%	83.24%	83.96%
	spatter	78.75%	79.46%	80.58%
	saturate	79.40%	79.40%	78.70%
Average		67.82%	68.49%	69.71%

Table 4: Performance comparison of different methods on CIFAR-10-C dataset at level 5 corruption, using the ResNet11 architecture.

The results on CIFAR-10-C, CIFAR-100-C, and Tiny-ImageNet-C at level 5 corruption, shown in Tab. 1, Tab. 2, and Tab. 3, demonstrate the effectiveness of SPACE. SPACE outperforms w.o. TTA, MEMO, and SITA in average accuracy across all datasets. On CIFAR-10-C, SPACE achieves 72.41%, improving by 4.48% over w.o. TTA (67.93%), 4.60% over SITA (67.81%), and 1.88% over MEMO (70.53%). On CIFAR-100-C, SPACE reaches 42.58%, surpassing w.o. TTA (41.33%), SITA (41.38%), and MEMO (41.76%). On Tiny-ImageNet-C, SPACE achieves 16.48%, compared to 14.47% for w.o. TTA, 14.66% for SITA, and 14.67% for MEMO. Notably, SPACE consistently improves performance on Tiny-ImageNet-C across all domains, showing its robustness on complex datasets.

Across individual corruption types, SPACE consistently outperforms w.o. TTA, MEMO, and SITA. For instance, on CIFAR-10-C, SPACE shows notable improvements, such as in impulse noise (69.41% for SPACE vs. 58.57% for w.o. TTA, 58.41% for SITA, and 65.74% for MEMO) and fog (52.80% for SPACE vs. 43.57% for w.o. TTA, 43.40% for SITA, and 45.47% for MEMO). On CIFAR-100-C, SPACE similarly excels, including in motion blur (43.46% for SPACE vs. 41.87% for w.o. TTA, 42.11% for SITA, and 42.55% for MEMO).

A particularly striking result is SPACE’s performance on Tiny-ImageNet-C, where it surpasses all methods across every corruption type. For example, in gaussian noise and impulse noise, SPACE leads by a significant margin. Even in simpler corruptions like brightness and pixelate, SPACE outperforms the w.o. TTA, demonstrating its robustness in complex datasets like Tiny-ImageNet-C.

Although SPACE generally outperforms MEMO and SITA, MEMO does slightly better in a few cases, such as jpeg_compression on CIFAR-10-C (86.05% for MEMO vs. 82.88% for SPACE) and on CIFAR-100-C (59.29% for MEMO vs. 58.96% for SPACE). These minor differences suggest MEMO has an edge with specific corruptions, but they do not diminish SPACE’s overall superiority.

4.1.2 Performance Comparison on ResNet Model

We evaluated the SPACE method on the CIFAR-10-C dataset using the ResNet11 architecture, which lacks BN layers, making SITA inapplicable. The results, shown in Tab. 4, demonstrate SPACE’s effectiveness across different model structures. SPACE achieves an average accuracy of 69.71%, improving by 1.22% over MEMO (68.49%) and 1.89% over w.o. TTA (67.82%).

In terms of individual corruption types, SPACE outperforms MEMO and w.o. TTA in most cases. Notable improvements include challenging corruptions such as impulse noise (56.85% vs. 49.59% for w.o. TTA and 51.35% for MEMO) and contrast (14.94% vs. 12.72% for w.o. TTA and 13.63% for MEMO). SPACE also achieves 80.32% on pixelate, outperforming MEMO (77.54%) and w.o. TTA (77.08%). These trends align with those observed in the VGG-based experiments, reaffirming SPACE’s consistency and generalizability.

While MEMO marginally outperforms SPACE on zoom (73.96% vs. 73.82%) and brightness (81.82% vs. 81.28%), these differences are minimal and do not overshadow SPACE’s overall advantages. The results further confirm that SPACE is not dependent on a specific architecture and can adapt effectively to various backbone models.

4.2 Ablation Study

Local vs. Global Feature Map Similarity. We compared the performance of local versus global feature map similarity for our proposed method. The results in Fig. 3 clearly illustrate the advantage of using similarity of local feature maps over global feature maps. While the global approach does not account for variability in neuron activations across channels, the local method isolates the contributions of active neurons within each channel, offering a more stable and meaningful measure of feature map consistency. As shown in Fig. 3, the local method yields a substantial accuracy improvement, whereas the global approach negatively impacts performance across all three datasets. This demonstrates that emphasizing individual channel stability leads to better generalization and adaptation. These findings underscore the distinct roles of each channel in SNNs and highlight the limitations of comparing entire feature maps.

Effect of SPACE on Feature Map Similarity. To further investigate the impact of SPACE adaptation on the feature map similarity, we analyzed the distribution of similarity values (computed using Eq. 4 $\sim$ Eq. 6) across the CIFAR-10-C dataset. As shown in Fig. 4, we observe a noticeable increase in the similarity between the original samples and their augmented counterparts after applying SPACE adaptation, particularly in the Gaussian Blur and Impulse Noise domains. The similarity distributions between spike counts based feature maps in these domains indicate that the feature map consistency improves following the adaptation process. Specifically, the “After SPACE Adaptation” distributions shift towards higher similarity values, reflecting the enhanced alignment of feature maps across augmented samples. This demonstrates the effectiveness of SPACE adaptation in increasing the stability of feature representations and improving generalization under various augmentation conditions.

5 Conclusion

In this work, we propose SPike-Aware Consistency Enhancement (SPACE), the first source-free and single-instance TTA method tailored for SNNs. SPACE leverages SNNs’ unique spike dynamics to optimize spike-behavior-based feature map consistency across augmented samples, addressing the limitations of ANN-based TTA methods, which often ignore the temporal and sparse nature of SNN computation. Extensive experiments on CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C, and DVS Gesture-C show SPACE achieves consistent performance improvements under severe corruptions, while generalizing effectively to various architectures like VGG9 and ResNet11. These results demonstrate SPACE’s robustness and adaptability to domain shifts, paving the way for its potential use in neuromorphic hardware and energy-efficient AI systems.

References

Amir et al. [2017] Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7243–7252, 2017.
Bellec et al. [2020] Guillaume Bellec, Franz Scherr, Anand Subramoney, Elias Hajek, Darjan Salaj, Robert Legenstein, and Wolfgang Maass. A solution to the learning dilemma for recurrent networks of spiking neurons. Nature communications, 11(1):3625, 2020.
Bu et al. [2022] Tong Bu, Wei Fang, Jianhao Ding, PENGLIN DAI, Zhaofei Yu, and Tiejun Huang. Optimal ANN-SNN conversion for high-accuracy and ultra-low-latency spiking neural networks. In International Conference on Learning Representations, 2022.
Cao et al. [2015] Yongqiang Cao, Yang Chen, and Deepak Khosla. Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision, 113:54–66, 2015.
Dayan and Abbott [2005] Peter Dayan and Laurence F Abbott. Theoretical neuroscience: computational and mathematical modeling of neural systems. MIT press, 2005.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Deng et al. [2023] Shikuang Deng, Hao Lin, Yuhang Li, and Shi Gu. Surrogate module learning: Reduce the gradient error accumulation in training spiking neural networks. In International Conference on Machine Learning, pages 7645–7657. PMLR, 2023.
Ding et al. [2021] Jianhao Ding, Zhaofei Yu, Yonghong Tian, and Tiejun Huang. Optimal ann-snn conversion for fast and accurate inference in deep spiking neural networks. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 2328–2336. International Joint Conferences on Artificial Intelligence Organization, 2021. Main Track.
Dong et al. [2024] Haoyu Dong, Nicholas Konz, Hanxue Gu, and Maciej A Mazurowski. Medical image segmentation with intent: Integrated entropy weighting for single image test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5046–5055, 2024.
Eshraghian et al. [2023] Jason K Eshraghian, Max Ward, Emre O Neftci, Xinxin Wang, Gregor Lenz, Girish Dwivedi, Mohammed Bennamoun, Doo Seok Jeong, and Wei D Lu. Training spiking neural networks using lessons from deep learning. Proceedings of the IEEE, 2023.
Fang et al. [2021] Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier, Tiejun Huang, and Yonghong Tian. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2661–2671, 2021.
Ghosh-Dastidar and Adeli [2009] Samanwoy Ghosh-Dastidar and Hojjat Adeli. Spiking neural networks. International journal of neural systems, 19(04):295–308, 2009.
Han et al. [2020] Bing Han, Gopalakrishnan Srinivasan, and Kaushik Roy. Rmp-snn: Residual membrane potential neuron for enabling deeper high-accuracy and low-latency spiking neural network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13558–13567, 2020.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Hendrycks and Dietterich [2018] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2018.
Hendrycks et al. [2021] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021.
Hu et al. [2023] Yangfan Hu, Qian Zheng, Xudong Jiang, and Gang Pan. Fast-snn: fast spiking neural network by converting quantized ann. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
Huang et al. [2024] Yifan Huang, Wei Fang, Zhengyu Ma, Guoqi Li, and Yonghong Tian. Flexible and scalable deep dendritic spiking neural networks with multiple nonlinear branching. arXiv preprint arXiv:2412.06355, 2024.
Iwasawa and Matsuo [2021] Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. Advances in Neural Information Processing Systems, 34:2427–2440, 2021.
Jiang et al. [2023] Haiyan Jiang, Srinivas Anumasa, Giulia De Masi, Huan Xiong, and Bin Gu. A unified optimization framework of ANN-SNN conversion: Towards optimal mapping from activation values to firing rates. In Proceedings of the 40th International Conference on Machine Learning, pages 14945–14974. PMLR, 2023.
Khurana et al. [2021] Ansh Khurana, Sujoy Paul, Piyush Rai, Soma Biswas, and Gaurav Aggarwal. Sita: Single image test-time adaptation. arXiv preprint arXiv:2112.02355, 2021.
Kim and Panda [2021] Youngeun Kim and Priyadarshini Panda. Revisiting batch normalization for training low-latency deep spiking neural networks from scratch. Frontiers in neuroscience, 15:773954, 2021.
Koh et al. [2021] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International conference on machine learning, pages 5637–5664. PMLR, 2021.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
LeCun et al. [1998] Yann LeCun, L eon Bottou, Yoshua Bengio, et al. Gradient-based learning applied to document recognition. PROC. OF THE IEEE, page 1, 1998.
Lee et al. [2020] Chankyu Lee, Syed Shakib Sarwar, Priyadarshini Panda, Gopalakrishnan Srinivasan, and Kaushik Roy. Enabling spike-based backpropagation for training deep neural network architectures. Frontiers in neuroscience, 14:497482, 2020.
Lee et al. [2013] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, page 896. Atlanta, 2013.
Lenz et al. [2021] Gregor Lenz, Kenneth Chaney, Sumit Bam Shrestha, Omar Oubari, Serge Picaud, and Guido Zarrella. Tonic: event-based datasets and transformations., 2021. Documentation available under https://tonic.readthedocs.io.
Lian et al. [2023] Shuang Lian, Jiangrong Shen, Qianhui Liu, Ziming Wang, Rui Yan, and Huajin Tang. Learnable surrogate gradient for direct training spiking neural networks. In IJCAI, pages 3002–3010, 2023.
Liang et al. [2020] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International conference on machine learning, pages 6028–6039. PMLR, 2020.
Neftci et al. [2019] Emre O Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6):51–63, 2019.
Niu et al. [2024] Shuaicheng Niu, Chunyan Miao, Guohao Chen, Pengcheng Wu, and Peilin Zhao. Test-time model adaptation with only forward passes. arXiv preprint arXiv:2404.01650, 2024.
Rathi et al. [2020] Nitin Rathi, Gopalakrishnan Srinivasan, Priyadarshini Panda, and Kaushik Roy. Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. In International Conference on Learning Representations, 2020.
Renner et al. [2024] Alpha Renner, Forrest Sheldon, Anatoly Zlotnik, Louis Tao, and Andrew Sornborger. The backpropagation algorithm implemented on spiking neuromorphic hardware. Nature Communications, 15(1):9691, 2024.
Roy et al. [2019] Kaushik Roy, Akhilesh Jaiswal, and Priyadarshini Panda. Towards spike-based machine intelligence with neuromorphic computing. Nature, 575(7784):607–617, 2019.
Sengupta et al. [2019] Abhronil Sengupta, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik Roy. Going deeper in spiking neural networks: Vgg and residual architectures. Frontiers in neuroscience, 13:95, 2019.
Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Sun et al. [2020] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, pages 9229–9248. PMLR, 2020.
Wang et al. [2020] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.
Wu et al. [2018] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience, 12:331, 2018.
Yin et al. [2021] Bojian Yin, Federico Corradi, and Sander M Bohté. Accurate and efficient time-domain classification with adaptive spiking recurrent neural networks. Nature Machine Intelligence, 3(10):905–913, 2021.
Zenke and Ganguli [2018] Friedemann Zenke and Surya Ganguli. Superspike: Supervised learning in multilayer spiking neural networks. Neural computation, 30(6):1514–1541, 2018.
Zhang et al. [2022] Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. Advances in neural information processing systems, 35:38629–38642, 2022.
Zheng et al. [2024] Hanle Zheng, Zhong Zheng, Rui Hu, Bo Xiao, Yujie Wu, Fangwen Yu, Xue Liu, Guoqi Li, and Lei Deng. Temporal dendritic heterogeneity incorporated with spiking neural networks for learning multi-timescale dynamics. Nature Communications, 15(1):277, 2024.

\thetitle

Supplementary Material

6 Kernel-Based Consistency Regularization

To investigate whether higher-dimensional feature relationships can enhance the consistency measure, we integrate kernel embeddings into the loss function. The channel-wise feature probability distribution $\mathbf{P}_{c}$ obtained from Eq. 4 can be mapped into a higher-dimensional reproducing kernel Hilbert space (RKHS) $\mathcal{H}$ , using a Gaussian kernel defined as

k(\mathbf{z},\mathbf{z}^{\prime})=\text{exp}(-\frac{\|\mathbf{z}-\mathbf{z}^{% \prime}\|^{2}}{2\sigma^{2}}),

(9)

where $\sigma$ is the kernel bandwidth. The kernel function $k(\mathbf{z},\mathbf{z}^{\prime})$ implicitly defines a mapping $\phi$ : $\mathcal{Z}\xrightarrow{}\mathcal{H}$ , such that the inner product in $\mathcal{H}$ is given by the kernel:

k(\mathbf{z},\mathbf{z}^{\prime})=\langle\phi(\mathbf{z}),\phi(\mathbf{z}^{% \prime})\rangle_{\mathcal{H}}.

(10)

This allows us to compute relationships between feature vectors in a high-dimensional space without explicitly constructing $\phi(\cdot)$ . To align feature distributions from augmented samples, we use the Mean Embedding of Distributions, which maps $\mathbf{P}_{c}(\mathbf{x}_{i})$ into a single point $\mu_{\mathbf{P}_{c}(\mathbf{x}_{i})}$ in RKHS. Specifically, the mean embedding is defined as

\mu_{\mathbf{P}_{c}(\mathbf{x}_{i})}=\mathbb{E}_{\mathbf{z}_{i}\sim\mathbf{P}_% {c}(\mathbf{x}_{i})}[\phi(\mathbf{z}_{i})].

(11)

Given samples $\mathbf{z}_{i}\sim\mathbf{P}_{c}(\mathbf{x}_{i})$ , the embedding can be approximated as

\mu_{\mathbf{P}_{c}(\mathbf{x}_{i})}\triangleq\frac{1}{D}\sum^{D}_{d=1}\phi(% \mathbf{z}_{i_{d}}).

(12)

To measure the discrepancy between distributions $\mathbf{P}_{c}(\mathbf{x}_{i})$ and $\mathbf{P}_{c}(\mathbf{x}_{j})$ in RKHS, we employ the Maximum Mean Discrepancy (MMD):

\text{MMD}^{2}(\mathbf{P}_{c}(\mathbf{x}_{i}),\mathbf{P}_{c}(\mathbf{x}_{j}))=% \|\mu_{\mathbf{P}_{c}(\mathbf{x}_{i})}-\mu_{\mathbf{P}_{c}(\mathbf{x}_{j})}\|^% {2}_{\mathcal{H}}.

(13)

Using the kernel trick, this can be computed without explicitly constructing $\phi(\cdot)$ :

\begin{split}\text{MMD}^{2}(\mathbf{P}_{c}(\mathbf{x}_{i}),\mathbf{P}_{c}(% \mathbf{x}_{j}))=\frac{1}{D^{2}}\|\sum^{D}_{d=1}\phi(\mathbf{z}_{i_{d}})-\sum^% {D}_{d=1}\phi(\mathbf{z}_{j_{d}})\|^{2}_{\mathcal{H}}.\end{split}

(14)

Then we average the MMD values across all channels:

\text{MMD}^{2}(i,j|\mathbf{x})=\frac{1}{C}\sum_{c=1}^{C}\text{MMD}^{2}(\mathbf% {P}_{c}(\mathbf{x}_{i}),\mathbf{P}_{c}(\mathbf{x}_{j})).

(15)

Finally, we integrate the average MMD values into the original loss function $\mathcal{L}$ :

\mathcal{L}^{*}(\theta;\mathbf{x})\triangleq\mathcal{L}+\lambda_{\text{MMD}}% \sum_{1\leq j<i\leq M}\text{MMD}^{2}(i,j|\mathbf{x})

(16)

Shifted Domain		w.o. TTA	SPACE	SPACE + Kernel Embedding
noise	gaussian	72.38%	77.98%	77.85%
	shot	74.70%	79.34%	79.34%
	speckle	71.75%	77.38%	77.33%
	impulse	58.57%	69.41%	69.25%
blur	defocus	63.05%	71.59%	71.42%
	gaussian	57.03%	68.31%	67.98%
	motion	64.44%	72.14%	72.21%
	zoom	71.33%	74.67%	74.61%
weather	snow	76.32%	78.43%	78.48%
	fog	43.57%	52.80%	52.46%
	frost	75.72%	79.59%	79.66%
digital	brightness	82.44%	83.22%	83.25%
	contrast	22.54%	23.85%	23.40%
	elastic_transform	75.01%	75.49%	75.42%
	pixelate	70.25%	76.24%	76.13%
	jpeg_compression	84.28%	82.88%	82.98%
	spatter	78.42%	77.52%	77.47%
	saturate	80.85%	82.54%	82.64%
Average		67.93%	72.41%	72.41%

Table 5: Performance comparison of SPACE without and with kernel embedding on CIFAR-10-C dataset at level 5 corruption, using the VGG9 architecture.

Shifted Domain	w.o. TTA	MEMO	SPACE
DropPixel	44.53%	43.94%	46.97%
DropEvent	50.00%	51.52%	53.41%
RefractoryPeriod	35.16%	34.85%	39.39%
TimeJitter	42.19%	40.91%	42.42%
SpatialJitter	83.98%	82.95%	85.23%
UniformNoise	75.39%	74.24%	75.39%
Average	55.21%	54.74%	57.14%

Table 6: Performance comparison of different methods on DVS Gesture-C dataset at level 5 corruption.

To evaluate the impact of incorporating kernel embedding into the loss function, we tested its performance on CIFAR-10-C dataset, using the BNTT model with the VGG9 architecture, as shown in Tab. 5. The results indicate that adding kernel embedding provides little to no improvement over the original SPACE method, with the average performance remaining nearly identical. This limited effectiveness can be attributed to several factors. First, the distribution differences introduced by augmentations, such as noise, blur, and weather effects, are relatively small, reducing the need for advanced alignment mechanisms like kernel embedding. Additionally, the existing similarity measures in SPACE already effectively capture the relationships between augmented samples, leaving little room for further optimization. Moreover, the added computational complexity of kernel embedding, especially the cost of kernel calculations, may even slightly hinder performance in scenarios where the original approach is already sufficient. These factors combined suggest that kernel embedding is not particularly beneficial in this experimental setting.

7 Performance on DVS Gesture-C Dataset

SNNs are inherently suited for processing event-based neuromorphic data due to their temporal dynamics and energy efficiency. Static image datasets like CIFAR-10-C, CIFAR-100-C, and Tiny-ImageNet-C are commonly used to evaluate robustness to corruption and distribution shifts, but they lack the temporal and asynchronous characteristics of real-world event-based data. DVS-Gesture [1] allows us to evaluate the proposed method’s performance on neuromorphic datasets, emphasizing its compatibility with dynamic vision tasks and data with temporal structure.

Following the state-of-the-art methodology [18], we evaluate the robustness improvements achieved by SPACE using DVS-Gesture-C, a corrupted variant of the standard DVS-Gesture dataset. This variant introduces six distinct corruption types: DropPixel, DropEvent, RefractoryPeriod, TimeJitter, SpatialJitter, and UniformNoise. These corruptions, implemented via the Tonic API [28], effectively simulate real-world imperfections in event-based data, including sensor noise and timing inaccuracies. To ensure a comprehensive and stringent assessment, we consistently employ the highest severity level across all experimental evaluations. The pre-trained model used in this study follows a 2CONV-2FC architecture, which was selected for its balance between computational efficiency and representational capacity, ensuring a fair and reliable comparison.

The experimental results demonstrate that the proposed SPACE method achieves superior performance on the DVS-Gesture-C dataset compared to both w.o. TTA and the MEMO method. As shown in Tab. 6, SPACE consistently outperforms w.o. TTA across all six corruption types, with the largest improvements observed for RefractoryPeriod (+4.23%) and DropEvent (+3.41%). Compared to MEMO, SPACE also achieves higher accuracy in most scenarios, with notable gains for RefractoryPeriod (+4.54%) and DropPixel (+3.03%). On average, SPACE achieves an accuracy of 57.14%, significantly surpassing both w.o. TTA (55.21%) and MEMO (54.74%).

While MEMO improves TTA performance by enforcing consistency among augmented samples, it faces significant challenges in SNNs due to the loss of temporal information when relying on backpropagation. This limitation is evident in the results, where MEMO struggles to handle certain corruption types effectively, such as RefractoryPeriod and TimeJitter. In contrast, SPACE addresses these challenges by fully leveraging the temporal dynamics and event-based nature of SNNs, resulting in more robust and reliable performance across diverse corruption types. These results highlight the robustness and effectiveness of SPACE in improving model performance under challenging conditions, making it a highly suitable method for event-based data with various noise and corruption types.