SPACE: SPike-Aware Consistency Enhancement for Test-Time Adaptation in Spiking Neural Networks

Xinyu Luo, Kecheng Chen, Pao-Sheng Vincent Sun, Chris XING TIAN,
Arindam Basu, Haoliang Li
Department of Electrical Engineering, City University of Hong Kong
{[email protected]}
Abstract

Spiking Neural Networks (SNNs), as a biologically plausible alternative to Artificial Neural Networks (ANNs), have demonstrated advantages in terms of energy efficiency, temporal processing, and biological plausibility. However, SNNs are highly sensitive to distribution shifts, which can significantly degrade their performance in real-world scenarios. Traditional test-time adaptation (TTA) methods designed for ANNs often fail to address the unique computational dynamics of SNNs, such as sparsity and temporal spiking behavior. To address these challenges, we propose SPike-Aware Consistency Enhancement (SPACE), the first source-free and single-instance TTA method specifically designed for SNNs. SPACE leverages the inherent spike dynamics of SNNs to maximize the consistency of spike-behavior-based local feature maps across augmented versions of a single test sample, enabling robust adaptation without requiring source data. We evaluate SPACE on multiple datasets, including CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C and DVS Gesture-C. Furthermore, SPACE demonstrates strong generalization across different model architectures, achieving consistent performance improvements on both VGG9 and ResNet11. Experimental results show that SPACE outperforms state-of-the-art methods, highlighting its effectiveness and robustness in real-world settings.

1 Introduction

Recent advancements in neuroscience-inspired computing have placed Spiking Neural Networks (SNNs) at the forefront of research as a biologically plausible alternative to traditional Artificial Neural Networks (ANNs). While ANNs have achieved remarkable success in various domains, their reliance on dense, black-box architectures often limits their interpretability and energy efficiency. In contrast, SNNs emulate the sparse, event-driven dynamics of biological neurons, providing several advantages in terms of computational efficiency, temporal processing, and explainability [4, 12, 35]. However, the unique features that make SNNs advantageous—such as their reliance on temporal information and sparse spiking—also pose significant challenges in dynamic real-world environments. SNNs encode information dynamically across time, making them highly sensitive to changes in input data distribution [2]. This sensitivity can result in drastic changes to spiking behavior across the network, ultimately degrading performance when faced with distribution shifts. Understanding and enhancing the robustness of SNNs under such conditions is therefore a critical research direction.

In real-world applications, it is common that the distribution of the training data (a.k.a. source domain) and the test data (a.k.a. target domain) are different due to variations in lighting, background, sensor noise, or environmental conditions. This phenomenon, known as domain shift, can significantly degrade the performance of machine learning models [15, 23]. SNNs are particularly vulnerable to domain shifts, as even minor changes in input data can disrupt the temporal dynamics and spiking behavior in intermediate layers, leading to poor generalization. Traditional test-time adaptation (TTA) solutions such as retraining or fine-tuning on target domain data with help of source data related proxy task [38], statistics of source data [32], or a batch of test data [39], are often impractical in real-world scenarios due to the unavailability of source data or other test samples, particularly in privacy-sensitive or resource-constrained environments. Source-free and single-instance TTA methods [9, 21, 43] offer a promising alternative by enabling the model to adapt to the target domain during inference without access to source data, which is typically defined (and also in this work) as one iteration of training adaptation with one test sample.

Among existing source-free and single-instance TTA methods, MEMO [43] and SITA [21] are two representative approaches, but both face fundamental limitations when applied to SNNs. MEMO optimizes the entropy of the averaged prediction across multiple augmented views of a test sample, enforcing output probability consistency. However, this strategy may not be optimal for SNNs. In SNNs, output probability distributions reflect only high-level semantic information and fail to capture the fine-grained spiking dynamics that encode essential features which prevents MEMO from effectively mitigating domain shifts within SNNs’ internal representations. On the other hand, SITA adapts batch normalization (BN) statistics by shifting them toward those computed from test-time augmentations, but this approach is also unsuitable for SNNs. In SNNs, many models lack BN layers [26, 42], which limits the applicability of BN-based adaptation methods such as SITA in SNNs. Additionally, due to the sparsity of SNN activations, modifying BN statistics does not significantly alter neuronal spiking patterns, further reducing its ability to facilitate effective adaptation.

In this work, we propose a novel method names SPike-Aware Consistency Enhancement (SPACE), which is the first source-free and single-instance TTA approach specifically designed for SNNs. SPACE performs adaptation using only a single test point without access to source data, making it particularly suitable for real-world scenarios where source data are unavailable or privacy-sensitive. By leveraging the inherent spike dynamics of SNNs, SPACE maximizes the consistency in spike-behavior-based local feature maps across augmented samples of the same input, ensuring robust and efficient adaptation in deep SNNs. Unlike existing TTA methods, which often overlook the unique characteristics of SNNs, SPACE directly exploits spike-based representations for adaptation. Our contributions can be summarized as follows

  1. 1.

    To the best of our knowledge, we are the first TTA method tailored for SNNs, which operates effectively using only a single test sample. This addresses the unique challenges of adapting SNNs in scenarios where source data are unavailable, while maintaining efficiency and scalability.

  2. 2.

    Our method introduces a consistency-driven optimization framework to enhance the similarity of spike-behavior-based local feature maps across augmented samples. By leveraging spike dynamics, this approach ensures robust adaptation in deep SNNs.

  3. 3.

    To assess the effectiveness and robustness of our proposed method, we perform extensive experiments across multiple benchmarks, including CIFAR-10, CIFAR-100 [24], Tiny-ImageNet (a modified subset of the ImageNet dataset) [6], and the neuromorphic dataset DVS Gesture [1]. Furthermore, we validate the adaptability of our method across different model architectures, specifically testing on VGG9 [37] and ResNet11 [14]. The results demonstrate that our approach not only achieves consistent performance improvements across datasets but also generalizes well to different network structures, showcasing its broad applicability and robustness.

2 Related Work

2.1 SNN-based Deep Learning

Over the years, significant progress has been made in SNN-based deep learning. Many approaches primarily relied on converting pre-trained ANN models into SNNs, enabling SNNs to inherit the representational power of ANNs while benefiting from spike-based computations [3, 8, 17, 20]. However, such conversion methods often suffer from performance degradation due to differences in activation dynamics. To overcome this, direct training of SNNs using surrogate gradient methods has gained traction, enabling end-to-end optimization of spiking models [7, 11, 29, 31, 42]. Rathi et al. [33] proposed a hybrid method that combines conversion and surrogate gradient-based method, achieving state-of-the-art performance. These advancements have not only solved the challenge owing to the non-differentiable nature of spike functions and the the inherent complexity of temporal dynamics, but also led to the development of sophisticated SNN architectures, such as spiking convolutional networks [26, 40] and spiking recurrent networks [41, 44], which have demonstrated promising results in tasks like image classification, speech recognition, and event-driven sensor data analysis.

A critical limitation of current SNN-based deep learning approaches is their poor robustness to testing-time variations. Unlike ANNs, which operate on continuous activations and can leverage well-established testing-time adaptation techniques, SNNs are highly sensitive to input perturbations due to their dependence on precise spike timing and temporal dynamics. According to our experimental results presented in Sec. 4, existing SNN models, which are typically trained offline, lack mechanisms to dynamically adapt to changing environments or incoming data distributions at testing time. Addressing these shortcomings is essential for deploying SNNs in practical applications, particularly those requiring real-time adaptability.

2.2 Test-time Adaptation

Test-time adaptation (TTA) is a rapidly growing area in machine learning that tackles the challenge of adapting pre-trained models to unseen or shifted distributions during inference [19, 30, 38, 39]. Test-time training (TTT) [38] performs online updates to model parameters using supervised proxy task on the test data. However, the dependence on source data makes these methods impractical in scenarios where the source domain is unavailable during deployment due to privacy or storage constraints. To overcome source dependency, source-free TTA methods have been proposed, which perform adaptation using only test-time inputs. TENT [39] utilizes a straightforward yet effective entropy minimization approach to optimize batch normalization parameters during test time, without relying on any proxy task during training. Lee et al. [27] employs pseudo-labeling to adjust model predictions.

Although these approaches eliminate the need for source data, they often assume batch-level adaptation, relying on test-time statistics computed over multiple samples. This assumption limits their applicability in real-world settings where only single test points are available at a time. MEMO [43] addresses this scenario, proposing minimizing the marginal output distribution in the increases of the single test point. The objective in MEMO encourages the model to produce consistent predictions across different augmentations, thus enforcing the invariances encoded in these augmentations, while also maintaining confidence in its predictions. Another TTA method, SITA [21], adapts batch normalization statistics at inference by shifting them toward those computed from a batch of augmented versions of the single test instance.

MEMO has shown that enforcing consistency among augmented samples improves TTA performance. However, in deep SNNs, where information is transmitted through discrete spikes, backpropagation tied to output objectives suffers from temporal information loss, making effective TTA challenging. Moreover, MEMO fails to capture the domain shift-induced variations in the internal behavior of the SNN, such as changes in neuron activations and spike patterns, which are critical for robust adaptation in SNNs. Although SITA avoids backpropagation, it also struggles to accurately model the intrinsic temporal dynamics of SNNs, limiting its ability to align augmented samples effectively. To overcome these issues, we enforce consistency by directly aligning the spike patterns and behaviors of augmented samples within the SNN framework. Unlike BN-dependent methods like SITA, our approach works regardless of the presence of BN layers, ensuring broader compatibility with various SNN architectures.

3 Proposed Method

3.1 Preliminary

SNNs emulate biological neurons by transmitting and encoding information through discrete spikes. Among various neuron models, the Leaky Integrate-and-Fire (LIF) neuron [5] is widely adopted for its balance between biological realism and computational efficiency. Its membrane potential dynamics follow

τmU(t)dt=U(t)+RI(t),subscript𝜏𝑚𝑈𝑡𝑑𝑡𝑈𝑡𝑅𝐼𝑡\tau_{m}\frac{U(t)}{dt}=-U(t)+RI(t),italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT divide start_ARG italic_U ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = - italic_U ( italic_t ) + italic_R italic_I ( italic_t ) , (1)

where U(t)𝑈𝑡U(t)italic_U ( italic_t ) represents the membrane potential of the neuron at time t𝑡titalic_t, τmsubscript𝜏𝑚\tau_{m}italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the membrane time constant, R𝑅Ritalic_R denotes the input resistance and I(t)𝐼𝑡I(t)italic_I ( italic_t ) is the input current received from pre-synaptic neurons or inputs. When U(t)𝑈𝑡U(t)italic_U ( italic_t ) exceeds a predefined threshold Uthsubscript𝑈𝑡U_{th}italic_U start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT, the neuron emits a spike and its membrane potential is reset to a resting value, typically 0 or U(t)Uth𝑈𝑡subscript𝑈𝑡U(t)-U_{th}italic_U ( italic_t ) - italic_U start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT. Following [11], Eq. 1 is discretized as

uit=(11τm)uit1+1τmjwijojt.subscriptsuperscript𝑢𝑡𝑖11subscript𝜏𝑚superscriptsubscript𝑢𝑖𝑡11subscript𝜏𝑚subscript𝑗subscript𝑤𝑖𝑗superscriptsubscript𝑜𝑗𝑡u^{t}_{i}=(1-\frac{1}{\tau_{m}})u_{i}^{t-1}+\frac{1}{\tau_{m}}\sum_{j}w_{ij}o_% {j}^{t}.italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 - divide start_ARG 1 end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ) italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT . (2)

Here, j𝑗jitalic_j is the index of pre-synaptic neurons, ojsubscript𝑜𝑗o_{j}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the binary spike activation, and wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT stands for weight connections between pre- and post-neurons. Fig. 1 illustrates the LIF neuron model, depicting membrane potential evolution and spike generation in response to input spikes [10].

The spiking mechanism in SNNs introduces non-differentiability, making it difficult to apply standard gradient-based optimization methods for training or test-time adaptation. To address this, surrogate gradient methods approximate the non-differentiable spike function with a smooth surrogate during the backward pass. A simple approach is the shifted Heaviside step function, where the gradient of ojsubscript𝑜𝑗o_{j}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the spiking activation function of neuron j𝑗jitalic_j, with respect to U𝑈Uitalic_U is defined as

ojU{1,if UUth,0,if U<Uth.subscript𝑜𝑗𝑈cases1if 𝑈subscript𝑈𝑡0if 𝑈subscript𝑈𝑡\frac{\partial o_{j}}{\partial U}\triangleq\begin{cases}1,&\text{if }U\geq U_{% th},\\ 0,&\text{if }U<U_{th}.\end{cases}divide start_ARG ∂ italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_U end_ARG ≜ { start_ROW start_CELL 1 , end_CELL start_CELL if italic_U ≥ italic_U start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL if italic_U < italic_U start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT . end_CELL end_ROW (3)

While this is not an exact solution, it is valid because a reset occurs after a spike is generated when UUth𝑈subscript𝑈𝑡U\geq U_{th}italic_U ≥ italic_U start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT.

These foundational techniques for SNN training, including LIF dynamics and surrogate gradient approximations, form the basis for our proposed method, which focuses on enhancing TTA by ensuring consistency in spiking behavior across augmented samples.

Refer to caption
Figure 1: Illustration of spike and membrane potential dynamics in LIF neurons. When the membrane potential reaches the threshold, it is reset by subtracting the threshold value, triggering the neuron to fire a spike.

3.2 SPike-Aware Consistency Enhancement

Refer to caption
Figure 2: Overview of the SPike-Aware Consistency Enhancement (SPACE) framework for TTA in SNNs. The test sample is selected from CIFAR-10-C dataset [15] with Gaussian Noise corruption at level 5. The model here follows the VGG9 [37] architecture. The process involves four main steps: 1) Generate an augmented batch from the single test sample. 2) Pass the augmented batch through the model to obtain local feature maps, represented by the spike counts over the processing time. 3) Adapt the model by maximizing the similarity across the local feature maps of the augmented samples. 4) Use the adapted model to predict the label of the original test sample.

Algorithm Design Overview. The proposed algorithm through spike-aware consistency enhancement (SPACE) for test-time adaptation is designed to improve test-time robustness of SNNs, as shown in Fig. 2. We first generate an augmented batch from a single test sample using various augmentation techniques, introducing diversity while retaining the core characteristics of the original sample. This augmented batch is passed through the model to obtain local feature maps, represented by spike counts over time. Next, the model is adapted by maximizing the similarity across the feature maps of the augmented samples, promoting consistency in feature representations. Finally, the adapted model predicts the label of the original test sample, ensuring robust performance under test-time conditions. This approach enhances the model’s ability to generalize to unseen test samples, particularly in shifted domains. The overall method is presented in Algorithm 1, and the details are introduced below. The code will be made available after acceptance.

Algorithm 1 SPACE for Test-Time Adaptation in SNNs
1:  Input: Test sample 𝐱𝐱\mathbf{x}bold_x, augmentation functions 𝒜={a1,,aN}𝒜subscript𝑎1subscript𝑎𝑁\mathcal{A}=\{a_{1},\dots,a_{N}\}caligraphic_A = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, SNN model with parameters θ𝜃\thetaitalic_θ, and learning rate η𝜂\etaitalic_η
2:  Output: Prediction ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of 𝐱𝐱\mathbf{x}bold_x using the adapted model
3:  Step 1: Generate augmented batch
4:  Initialize augmented batch =\mathcal{B}=\emptysetcaligraphic_B = ∅
5:  for each aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in subset {a1,,aM}subscript𝑎1subscript𝑎𝑀\{a_{1},\dots,a_{M}\}{ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } of 𝒜𝒜\mathcal{A}caligraphic_A do
6:     Apply augmentation: 𝐱k=ak(𝐱)subscript𝐱𝑘subscript𝑎𝑘𝐱\mathbf{x}_{k}=a_{k}(\mathbf{x})bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x )
7:     Add 𝐱ksubscript𝐱𝑘\mathbf{x}_{k}bold_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to \mathcal{B}caligraphic_B
8:  end for
9:  Step 2: Extract Feature Maps
10:  Pass \mathcal{B}caligraphic_B through the SNN feature extractor
11:  Obtain feature maps:
12:  ={𝐅(𝐱1),𝐅(𝐱1),,𝐅(𝐱M)}𝐅subscript𝐱1𝐅subscript𝐱1𝐅subscript𝐱𝑀\mathcal{F}=\{\mathbf{F}(\mathbf{x}_{1}),\mathbf{F}(\mathbf{x}_{1}),\dots,% \mathbf{F}(\mathbf{x}_{M})\}caligraphic_F = { bold_F ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , bold_F ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , bold_F ( bold_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) }
13:  Step 3: Compute Similarity
14:  for each pair (𝐅(𝐱i),𝐅(𝐱j))𝐅subscript𝐱𝑖𝐅subscript𝐱𝑗(\mathbf{F}(\mathbf{x}_{i}),\mathbf{F}(\mathbf{x}_{j}))( bold_F ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_F ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) in \mathcal{F}caligraphic_F do
15:     Compute similarity 𝒮¯(i,j|𝐱)¯𝒮𝑖conditional𝑗𝐱\bar{\mathcal{S}}(i,j|\mathbf{x})over¯ start_ARG caligraphic_S end_ARG ( italic_i , italic_j | bold_x ) using Eq. 4 similar-to\sim Eq. 6
16:  end for
17:  Step 4: Compute loss and update model
18:  Compute loss (θ;𝐱)𝜃𝐱\mathcal{L}(\theta;\mathbf{x})caligraphic_L ( italic_θ ; bold_x ) using Eq. 7
19:  Update parameters using Eq. 8
20:  Step 5: Predict via adapted model
21:  Predict via: y=argmaxypθ(y|𝐱)superscript𝑦𝑎𝑟𝑔𝑚𝑎subscript𝑥𝑦subscript𝑝superscript𝜃conditional𝑦𝐱y^{*}=argmax_{y}p_{\theta^{*}}(y|\mathbf{x})italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y | bold_x )
22:  Return: Prediction result ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

Spike-Aware Feature Maps. SPACE aims to adapt pre-trained SNN-based models Mθsubscript𝑀𝜃M_{\theta}italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (with parameters θΘ𝜃Θ\theta\in\Thetaitalic_θ ∈ roman_Θ) during inference to improve performance under distribution shifts, without requiring ground truth labels or access to source data. No specific training process is required, nor do we impose particular constraints on the model. The only assumptions are that θ𝜃\thetaitalic_θ can be adjusted and that the model produces intermediate feature maps 𝐅(𝐱)𝐅𝐱\mathbf{F}(\mathbf{x})bold_F ( bold_x ), which is differentiable with respect to θ𝜃\thetaitalic_θ and can be utilized for further analysis and adaptation. Here, 𝐱𝒳𝐱𝒳\mathbf{x}\in\mathcal{X}bold_x ∈ caligraphic_X is a single test point that is presented to Mθsubscript𝑀𝜃M_{\theta}italic_M start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. It is worth noting that this is a reasonable assumption, supported by a recent advance [34] that suggest gradient back-propagation can be implemented on spiking neuromorphic hardware. Aligned with previous work [21, 43], we achieve test-time robustness by leveraging data augmentations and a self-supervised adaptation objective. In our work, a set of augmentation functions with different intensities selected from 𝒜{a1,,aN}𝒜subscript𝑎1subscript𝑎𝑁\mathcal{A}\triangleq\{a_{1},\dots,a_{N}\}caligraphic_A ≜ { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } is applied to a single test input 𝐱𝐱\mathbf{x}bold_x, forming an augmented batch of samples ={𝐱1,𝐱2,,𝐱M},MNformulae-sequencesubscript𝐱1subscript𝐱2subscript𝐱𝑀𝑀𝑁\mathcal{B}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{M}\},M\leq Ncaligraphic_B = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } , italic_M ≤ italic_N. Using this batch, we align the spike dynamics extracted from augmentations, ensuring that the model produces reliable and robust predictions in the presence of distribution shifts.

Different from MEMO [43], which adapts the model during test time via the conditional output distribution pθ(y|𝐱)subscript𝑝𝜃conditional𝑦𝐱p_{\theta}(y|\mathbf{x})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | bold_x ) of the model’s last layer, SPACE leverages the dynamic behavior of intermediate layers in the SNN that plays a crucial role in determining the temporal information and semantic representation of the final output. SNNs encode information through temporal dynamics and spike sparsity. These behavioral characteristics are particularly prominent in the intermediate layers and can be represented by feature maps 𝐅(𝐱)𝐅𝐱\mathbf{F}(\mathbf{x})bold_F ( bold_x ). Unlike ANNs, these characteristics in SNNs cannot be fully captured by the output probability distribution alone. Another advantage of intermediate feature maps is its highly sensitivity to changes in the input distribution due to the event-driven nature of SNNs. When the input data undergoes a distribution shift, the spiking patterns in the intermediate layers may change significantly, thereby impacting the final output. Optimizing the consistency of spiking behavior in intermediate layers can effectively mitigate feature variations caused by distribution shifts, thereby enhancing the robustness of models.

To leverage the feature maps from the intermediate layers of the SNN, we focus on the spike activity over the entire time window. As described in Sec. 3.1, each neuron j𝑗jitalic_j in an SNN can either fire a spike (outputting 1) or remain silent (outputting 0) at each time step t𝑡titalic_t, represented by the binary function ojtsuperscriptsubscript𝑜𝑗𝑡o_{j}^{t}italic_o start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. The temporal spike patterns of the network are captured in the feature map 𝐎(𝐱𝐢)T×C×H×W𝐎subscript𝐱𝐢superscript𝑇𝐶𝐻𝑊\mathbf{O(\mathbf{x}_{i})}\in\mathbb{R}^{T\times C\times H\times W}bold_O ( bold_x start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, where T𝑇Titalic_T denotes the number of time steps, C𝐶Citalic_C represents the number of channels, and H𝐻Hitalic_H and W𝑊Witalic_W are the spatial dimensions of the local feature maps. 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT refers to the i𝑖iitalic_i-th sample in the augmented batch. Our primary objective is to use the total spike counts of the feature maps from either the last layer or the penultimate layer of the feature extraction network (typically a CNN [25]), depending on whether the final feature map is pooled into a 1×\times×1 matrix. Let ={𝐅(𝐱1),𝐅(𝐱1),,𝐅(𝐱M)}𝐅subscript𝐱1𝐅subscript𝐱1𝐅subscript𝐱𝑀\mathcal{F}=\{\mathbf{F}(\mathbf{x}_{1}),\mathbf{F}(\mathbf{x}_{1}),\dots,% \mathbf{F}(\mathbf{x}_{M})\}caligraphic_F = { bold_F ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , bold_F ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , bold_F ( bold_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ) } denote the collection of feature maps, where each 𝐅(𝐱i)C×D𝐅subscript𝐱𝑖superscript𝐶𝐷\mathbf{F}(\mathbf{x}_{i})\in\mathbb{R}^{C\times D}bold_F ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_D end_POSTSUPERSCRIPT represents the total spike counts of neurons across different channels, with D=H×W𝐷𝐻𝑊D=H\times Witalic_D = italic_H × italic_W indicating the spatial dimensionality.

There are several reasons for choosing the total spike counts as the observation metric. First, SNNs typically operate over a time window ranging from 25 to thousands time steps [4, 13, 26, 33, 36], meaning that spike patterns across many time steps may be identical due to the repetitive nature of the spike dynamics. By comparing the cumulative spike counts instead of individual time-step patterns, we can effectively capture the temporal accumulation of spike activity while reducing redundancy. Second, the total spike counts reflects differences between augmented samples based on their overall spike dynamics, which is essential for our goal of aligning feature maps across augmentations. Finally, this representation significantly reduces the computational cost, as it condenses the temporal information into an aggregate value without requiring a step-by-step comparison of spike patterns.

Feature Maps Alignment. The primary objective is to encourage the model to focus on invariant features that are crucial for classification, rather than being influenced by noise or minor transformations introduced by augmentations or distribution shifts in the training data. When the feature maps of augmented samples are highly similar, the model becomes less sensitive to variations at test time, thereby improving its ability to generalize to new, unseen, and unlabeled data. To measure the similarity between feature representations of the i𝑖iitalic_i-th and j𝑗jitalic_j-th samples, we employ channel-wise similarity between corresponding local feature vectors, 𝐅c(𝐱i)subscript𝐅𝑐subscript𝐱𝑖\mathbf{F}_{c}(\mathbf{x}_{i})bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and 𝐅c(𝐱j)subscript𝐅𝑐subscript𝐱𝑗\mathbf{F}_{c}(\mathbf{x}_{j})bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Specifically, we first normalize local feature vectors using a softmax function along the spatial dimensions, ensuring that the features are transformed into a probability distribution:

𝐏c(𝐱i)=softmax(𝐅c(𝐱i)).subscript𝐏𝑐subscript𝐱𝑖softmaxsubscript𝐅𝑐subscript𝐱𝑖\mathbf{P}_{c}(\mathbf{x}_{i})=\textit{softmax}(\mathbf{F}_{c}(\mathbf{x}_{i})).bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = softmax ( bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (4)

Here, 𝐅c(𝐱i)subscript𝐅𝑐subscript𝐱𝑖\mathbf{F}_{c}(\mathbf{x}_{i})bold_F start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represent the local feature vectors corresponding to channel c𝑐citalic_c of the augmented sample 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This normalization highlights important spatial locations by amplifying prominent feature values and suppressing noise, enhancing feature distribution stability while enabling the subsequent similarity computation to capture both spatial and temporal characteristics effectively. To quantify the similarity between 𝐅(𝐱i)𝐅subscript𝐱𝑖\mathbf{F}(\mathbf{x}_{i})bold_F ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and 𝐅(𝐱j)𝐅subscript𝐱𝑗\mathbf{F}(\mathbf{x}_{j})bold_F ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), we compute the average of channel-wise weighted inner product, defined as

𝒮c(i,j|𝐱)=d=1D𝐏c,d(𝐱i)𝐏c,d(𝐱j),subscript𝒮𝑐𝑖conditional𝑗𝐱superscriptsubscript𝑑1𝐷subscript𝐏𝑐𝑑subscript𝐱𝑖subscript𝐏𝑐𝑑subscript𝐱𝑗\displaystyle\mathcal{S}_{c}(i,j|\mathbf{x})=\sum_{d=1}^{D}\mathbf{P}_{c,d}(% \mathbf{x}_{i})\cdot\mathbf{P}_{c,d}(\mathbf{x}_{j}),caligraphic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_i , italic_j | bold_x ) = ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT bold_P start_POSTSUBSCRIPT italic_c , italic_d end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ bold_P start_POSTSUBSCRIPT italic_c , italic_d end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (5)
𝒮¯(i,j|𝐱)=1Cc=1C𝒮c(i,j|𝐱).¯𝒮𝑖conditional𝑗𝐱1𝐶superscriptsubscript𝑐1𝐶subscript𝒮𝑐𝑖conditional𝑗𝐱\displaystyle\bar{\mathcal{S}}(i,j|\mathbf{x})=\frac{1}{C}\sum_{c=1}^{C}% \mathcal{S}_{c}(i,j|\mathbf{x}).over¯ start_ARG caligraphic_S end_ARG ( italic_i , italic_j | bold_x ) = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT caligraphic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_i , italic_j | bold_x ) . (6)

The similarity measures the degree of overlap between the feature distributions, with higher values indicating greater consistency in spike-based representations across augmentations. Based on this similarity measure, we define the loss function for model adaptation given as

(θ;𝐱)1j<iM(1𝒮¯(i,j|𝐱)).𝜃𝐱subscript1𝑗𝑖𝑀1¯𝒮𝑖conditional𝑗𝐱\mathcal{L}(\theta;\mathbf{x})\triangleq\sum_{1\leq j<i\leq M}\left(1-\bar{% \mathcal{S}}(i,j|\mathbf{x})\right).caligraphic_L ( italic_θ ; bold_x ) ≜ ∑ start_POSTSUBSCRIPT 1 ≤ italic_j < italic_i ≤ italic_M end_POSTSUBSCRIPT ( 1 - over¯ start_ARG caligraphic_S end_ARG ( italic_i , italic_j | bold_x ) ) . (7)

Then we adapt the model via gradient descent for only one iteration given as

θθηθ(θ;𝐱),superscript𝜃𝜃𝜂subscript𝜃𝜃𝐱\theta^{*}\leftarrow\theta-\eta\nabla_{\theta}\mathcal{L}(\theta;\mathbf{x}),italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← italic_θ - italic_η ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L ( italic_θ ; bold_x ) , (8)

where η𝜂\etaitalic_η is learning rate here. By minimizing this loss function, we enforce consistency across the feature maps of different augmentations. This reduces the variability induced by augmentations, ensuring that the model learns stable and robust feature representations. Furthermore, by promoting feature consistency, this approach serves as a regularization mechanism, preventing over-fitting to specific augmentations and encouraging the model to focus on more fundamental and generalizable features.

While the proposed similarity measure effectively captures feature distribution overlap, we further investigated whether incorporating higher-dimensional feature relationships could enhance the consistency measure. Specifically, we experimented with a kernel embedding approach combined with Maximum Mean Discrepancy (MMD) (see Sec. 6 in Supplementary Material). However, this method did not yield noticeable performance improvements and introduced additional computational overhead, suggesting that the original similarity measure is already sufficient for our test-time adaptation.

Finally, we predict the classification via an updated model.

Discussion. In SNNs, each local feature map captures distinct aspects of the input data and exhibits different spike patterns, both spatially and temporally. Since channels may have different active neurons, directly comparing the entire feature map fails to account for this variability. Instead, computing the similarity per channel and averaging the results focuses on the stability of feature representations within each channel. This isolates the contributions of active neurons, ensuring the similarity measure reflects meaningful changes rather than noise. It also minimizes the impact of irregular spike behavior or augmentation-induced noise in individual channels. Our experiments in Sec. 4.2 show that directly comparing entire feature maps leads to ineffective TTA, reinforcing the importance of the per-channel similarity approach. Given that each channel learns distinct features, averaging the similarity within each channel provides a more reliable measure of feature map consistency, improving TTA performance.

4 Experimental Results

Datasets. The experiments here were conducted on three benchmark datasets: CIFAR-10-C, CIFAR-100-C, and Tiny-ImageNet-C [15], each containing a variety of common image corruptions. CIFAR-10-C includes 18 corruption types, while CIFAR-100-C features 19 types, covering weather, noise, and digital distortions. Both CIFAR-10-C and CIFAR-100-C consist of 10 and 100 classes of 32×\times×32 color images, respectively, similar to CIFAR-10 and CIFAR-100 [24]. Tiny-ImageNet-C, derived from the Tiny-ImageNet dataset, contains 15 corruption types and 200 classes of 64×\times×64 color images. In addition, we also considered a neuromorphic dataset, DVS Gesture-C [18], and the related experiments are provided in the Supplementary Material Sec. 7.

Experimental Setup. All experiments were conducted under the highest corruption level (level=5). Following the experimental setup of MEMO [43], we applied AugMix [16] augmentation with a batch size of 32 to each test sample to generate a diverse set of augmented samples. The backbone model for most experiments is an SNN-based VGG architecture with Batch-Normalization Through Time (BNTT) [22], featuring BN layers at each time step, with 25 inference steps. To demonstrate the general applicability of the proposed SPACE method, additional experiments were conducted using an SNN-based ResNet-11 model [26] on the CIFAR-10-C dataset, with inference steps set to 30.

Baseline TTA Methods. We compared our approach with models pre-trained on clean datasets without any test-time adaptation (referred to as w.o. TTA). Moreover, since few TTA methods are specifically designed for SNNs, we selected two most representative traditional TTA methods, MEMO [43] and SITA [21], as our baselines.

4.1 Main Results

Shifted Domain w.o. TTA SITA MEMO SPACE
noise gaussian 72.38% 73.06% 77.73% 77.98%
shot 74.70% 74.15% 79.50% 79.34%
speckle 71.75% 71.67% 77.28% 77.38%
impulse 58.57% 58.41% 65.74% 69.41%
blur defocus 63.05% 62.94% 65.61% 71.59%
gaussian 57.03% 57.36% 59.93% 68.31%
motion 64.44% 64.36% 66.24% 72.14%
zoom 71.33% 70.72% 72.61% 74.67%
weather snow 76.32% 76.67% 77.38% 78.43%
fog 43.57% 43.40% 45.47% 52.80%
frost 75.72% 75.24% 78.64% 79.59%
digital brightness 82.44% 82.51% 83.05% 83.22%
contrast 22.54% 22.02% 23.83% 23.85%
elastic_transform 75.01% 74.74% 76.24% 75.49%
pixelate 70.25% 69.62% 73.25% 76.24%
jpeg_compression 84.28% 84.16% 86.05% 82.88%
spatter 78.42% 78.61% 79.75% 77.52%
saturate 80.85% 80.97% 82.03% 82.54%
Average 67.93% 67.81% 70.53% 72.41%
Table 1: Performance comparison of different methods on CIFAR-10-C dataset at level 5 corruption, using the VGG9 architecture.

In this section, we compare the performance of the proposed SPACE method with two existing approaches: MEMO [43] and SITA [21], which are current state-of-the-art methods for source-free and single-instance TTA. While MEMO improves TTA performance by enforcing consistency among augmented samples, it faces significant challenges in SNNs due to the loss of temporal information when relying on backpropagation. Similarly, SITA, despite not using backpropagation, fails to capture the intrinsic temporal dynamics of SNNs, rendering it ineffective. To demonstrate the broad applicability of the proposed method, we evaluate all methods on CIFAR-10-C, CIFAR-100-C, and Tiny-ImageNet-C datasets using the VGG architecture, and on CIFAR-10-C with the ResNet-11 architecture to demonstrate the broad applicability of our proposed method.

4.1.1 Performance Comparison on VGG Model

Shifted Domain w.o. TTA SITA MEMO SPACE
noise gaussian 42.51% 43.01% 43.46% 44.71%
shot 45.40% 44.83% 45.41% 46.73%
speckle 43.35% 43.14% 43.53% 44.84%
impulse 25.50% 25.48% 26.07% 27.99%
blur defocus 42.15% 42.23% 42.47% 43.35%
gaussian 37.76% 37.88% 38.29% 39.19%
motion 41.87% 42.11% 42.55% 43.46%
zoom 46.77% 46.93% 47.21% 48.42%
glass 42.84% 43.29% 43.90% 44.47%
weather snow 44.88% 44.49% 44.74% 45.70%
fog 16.51% 16.65% 16.76% 17.62%
frost 44.12% 43.84% 44.34% 44.98%
digital brightness 49.36% 49.53% 49.95% 50.41%
contrast 5.42% 5.56% 5.62% 5.69%
elastic_transform 51.33% 50.99% 51.24% 51.45%
pixelate 52.35% 52.48% 52.82% 53.81%
jpeg_compression 58.47% 58.71% 59.29% 58.96%
spatter 47.76% 47.82% 48.63% 48.95%
saturate 46.91% 47.17% 47.17% 48.22%
Average 41.33% 41.38% 41.76% 42.58%
Table 2: Performance comparison of different methods on CIFAR-100-C dataset at level 5 corruption, using the VGG11 architecture.
Shifted Domain w.o. TTA SITA MEMO SPACE
noise gaussian 12.43% 12.71% 13.64% 16.71%
shot 15.20% 15.17% 16.45% 19.44%
impulse 8.74% 8.87% 9.51% 11.72%
defocus 7.58% 7.56% 7.51% 9.62%
blur motion 14.48% 14.57% 14.38% 16.57%
zoom 13.59% 13.94% 13.92% 15.61%
glass 6.50% 6.57% 6.60% 7.31%
weather snow 15.58% 15.72% 15.97% 18.66%
fog 5.27% 5.32% 4.91% 6.01%
frost 18.06% 18.05% 17.83% 21.02%
digital brightness 15.01% 14.82% 14.74% 17.53%
contrast 1.32% 1.38% 1.44% 1.51%
elastic_transform 20.53% 21.45% 21.00% 21.96%
pixelate 31.62% 32.20% 31.39% 31.89%
jpeg_compression 31.15% 31.51% 30.82% 31.63%
Average 14.47% 14.66% 14.67% 16.48%
Table 3: Performance comparison of different methods on Tiny-ImageNet-C dataset at level 5 corruption, using the VGG11 architecture.
Shifted Domain w.o. TTA MEMO SPACE
noise gaussian 72.88% 74.21% 75.90%
shot 74.49% 75.50% 77.81%
speckle 73.12% 74.32% 76.75%
impulse 49.59% 51.35% 56.85%
blur defocus 67.77% 68.11% 68.66%
gaussian 63.57% 64.11% 64.15%
motion 65.14% 65.37% 66.02%
zoom 73.06% 73.96% 73.82%
weather snow 78.11% 78.32% 78.74%
fog 42.84% 43.47% 45.47%
frost 73.12% 73.39% 74.04%
digital brightness 81.71% 81.82% 81.28%
contrast 12.72% 13.63% 14.94%
elastic_transform 74.88% 75.54% 76.79%
pixelate 77.08% 77.54% 80.32%
jpeg_compression 82.47% 83.24% 83.96%
spatter 78.75% 79.46% 80.58%
saturate 79.40% 79.40% 78.70%
Average 67.82% 68.49% 69.71%
Table 4: Performance comparison of different methods on CIFAR-10-C dataset at level 5 corruption, using the ResNet11 architecture.

The results on CIFAR-10-C, CIFAR-100-C, and Tiny-ImageNet-C at level 5 corruption, shown in Tab. 1, Tab. 2, and Tab. 3, demonstrate the effectiveness of SPACE. SPACE outperforms w.o. TTA, MEMO, and SITA in average accuracy across all datasets. On CIFAR-10-C, SPACE achieves 72.41%, improving by 4.48% over w.o. TTA (67.93%), 4.60% over SITA (67.81%), and 1.88% over MEMO (70.53%). On CIFAR-100-C, SPACE reaches 42.58%, surpassing w.o. TTA (41.33%), SITA (41.38%), and MEMO (41.76%). On Tiny-ImageNet-C, SPACE achieves 16.48%, compared to 14.47% for w.o. TTA, 14.66% for SITA, and 14.67% for MEMO. Notably, SPACE consistently improves performance on Tiny-ImageNet-C across all domains, showing its robustness on complex datasets.

Across individual corruption types, SPACE consistently outperforms w.o. TTA, MEMO, and SITA. For instance, on CIFAR-10-C, SPACE shows notable improvements, such as in impulse noise (69.41% for SPACE vs. 58.57% for w.o. TTA, 58.41% for SITA, and 65.74% for MEMO) and fog (52.80% for SPACE vs. 43.57% for w.o. TTA, 43.40% for SITA, and 45.47% for MEMO). On CIFAR-100-C, SPACE similarly excels, including in motion blur (43.46% for SPACE vs. 41.87% for w.o. TTA, 42.11% for SITA, and 42.55% for MEMO).

A particularly striking result is SPACE’s performance on Tiny-ImageNet-C, where it surpasses all methods across every corruption type. For example, in gaussian noise and impulse noise, SPACE leads by a significant margin. Even in simpler corruptions like brightness and pixelate, SPACE outperforms the w.o. TTA, demonstrating its robustness in complex datasets like Tiny-ImageNet-C.

Although SPACE generally outperforms MEMO and SITA, MEMO does slightly better in a few cases, such as jpeg_compression on CIFAR-10-C (86.05% for MEMO vs. 82.88% for SPACE) and on CIFAR-100-C (59.29% for MEMO vs. 58.96% for SPACE). These minor differences suggest MEMO has an edge with specific corruptions, but they do not diminish SPACE’s overall superiority.

4.1.2 Performance Comparison on ResNet Model

We evaluated the SPACE method on the CIFAR-10-C dataset using the ResNet11 architecture, which lacks BN layers, making SITA inapplicable. The results, shown in Tab. 4, demonstrate SPACE’s effectiveness across different model structures. SPACE achieves an average accuracy of 69.71%, improving by 1.22% over MEMO (68.49%) and 1.89% over w.o. TTA (67.82%).

In terms of individual corruption types, SPACE outperforms MEMO and w.o. TTA in most cases. Notable improvements include challenging corruptions such as impulse noise (56.85% vs. 49.59% for w.o. TTA and 51.35% for MEMO) and contrast (14.94% vs. 12.72% for w.o. TTA and 13.63% for MEMO). SPACE also achieves 80.32% on pixelate, outperforming MEMO (77.54%) and w.o. TTA (77.08%). These trends align with those observed in the VGG-based experiments, reaffirming SPACE’s consistency and generalizability.

While MEMO marginally outperforms SPACE on zoom (73.96% vs. 73.82%) and brightness (81.82% vs. 81.28%), these differences are minimal and do not overshadow SPACE’s overall advantages. The results further confirm that SPACE is not dependent on a specific architecture and can adapt effectively to various backbone models.

4.2 Ablation Study

Refer to caption
Figure 3: Local Feature Maps Based SPACE Outperforms Global Feature Maps Based SPACE.

Local vs. Global Feature Map Similarity. We compared the performance of local versus global feature map similarity for our proposed method. The results in Fig. 3 clearly illustrate the advantage of using similarity of local feature maps over global feature maps. While the global approach does not account for variability in neuron activations across channels, the local method isolates the contributions of active neurons within each channel, offering a more stable and meaningful measure of feature map consistency. As shown in Fig. 3, the local method yields a substantial accuracy improvement, whereas the global approach negatively impacts performance across all three datasets. This demonstrates that emphasizing individual channel stability leads to better generalization and adaptation. These findings underscore the distinct roles of each channel in SNNs and highlight the limitations of comparing entire feature maps.

Refer to caption
Figure 4: Distribution of similarity values between spike counts based feature maps before and after SPACE adaptation.

Effect of SPACE on Feature Map Similarity. To further investigate the impact of SPACE adaptation on the feature map similarity, we analyzed the distribution of similarity values (computed using Eq. 4 similar-to\sim Eq. 6) across the CIFAR-10-C dataset. As shown in Fig. 4, we observe a noticeable increase in the similarity between the original samples and their augmented counterparts after applying SPACE adaptation, particularly in the Gaussian Blur and Impulse Noise domains. The similarity distributions between spike counts based feature maps in these domains indicate that the feature map consistency improves following the adaptation process. Specifically, the “After SPACE Adaptation” distributions shift towards higher similarity values, reflecting the enhanced alignment of feature maps across augmented samples. This demonstrates the effectiveness of SPACE adaptation in increasing the stability of feature representations and improving generalization under various augmentation conditions.

5 Conclusion

In this work, we propose SPike-Aware Consistency Enhancement (SPACE), the first source-free and single-instance TTA method tailored for SNNs. SPACE leverages SNNs’ unique spike dynamics to optimize spike-behavior-based feature map consistency across augmented samples, addressing the limitations of ANN-based TTA methods, which often ignore the temporal and sparse nature of SNN computation. Extensive experiments on CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C, and DVS Gesture-C show SPACE achieves consistent performance improvements under severe corruptions, while generalizing effectively to various architectures like VGG9 and ResNet11. These results demonstrate SPACE’s robustness and adaptability to domain shifts, paving the way for its potential use in neuromorphic hardware and energy-efficient AI systems.

References

  • Amir et al. [2017] Arnon Amir, Brian Taba, David Berg, Timothy Melano, Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak, Alexander Andreopoulos, Guillaume Garreau, Marcela Mendoza, et al. A low power, fully event-based gesture recognition system. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7243–7252, 2017.
  • Bellec et al. [2020] Guillaume Bellec, Franz Scherr, Anand Subramoney, Elias Hajek, Darjan Salaj, Robert Legenstein, and Wolfgang Maass. A solution to the learning dilemma for recurrent networks of spiking neurons. Nature communications, 11(1):3625, 2020.
  • Bu et al. [2022] Tong Bu, Wei Fang, Jianhao Ding, PENGLIN DAI, Zhaofei Yu, and Tiejun Huang. Optimal ANN-SNN conversion for high-accuracy and ultra-low-latency spiking neural networks. In International Conference on Learning Representations, 2022.
  • Cao et al. [2015] Yongqiang Cao, Yang Chen, and Deepak Khosla. Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision, 113:54–66, 2015.
  • Dayan and Abbott [2005] Peter Dayan and Laurence F Abbott. Theoretical neuroscience: computational and mathematical modeling of neural systems. MIT press, 2005.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Deng et al. [2023] Shikuang Deng, Hao Lin, Yuhang Li, and Shi Gu. Surrogate module learning: Reduce the gradient error accumulation in training spiking neural networks. In International Conference on Machine Learning, pages 7645–7657. PMLR, 2023.
  • Ding et al. [2021] Jianhao Ding, Zhaofei Yu, Yonghong Tian, and Tiejun Huang. Optimal ann-snn conversion for fast and accurate inference in deep spiking neural networks. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 2328–2336. International Joint Conferences on Artificial Intelligence Organization, 2021. Main Track.
  • Dong et al. [2024] Haoyu Dong, Nicholas Konz, Hanxue Gu, and Maciej A Mazurowski. Medical image segmentation with intent: Integrated entropy weighting for single image test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5046–5055, 2024.
  • Eshraghian et al. [2023] Jason K Eshraghian, Max Ward, Emre O Neftci, Xinxin Wang, Gregor Lenz, Girish Dwivedi, Mohammed Bennamoun, Doo Seok Jeong, and Wei D Lu. Training spiking neural networks using lessons from deep learning. Proceedings of the IEEE, 2023.
  • Fang et al. [2021] Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier, Tiejun Huang, and Yonghong Tian. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2661–2671, 2021.
  • Ghosh-Dastidar and Adeli [2009] Samanwoy Ghosh-Dastidar and Hojjat Adeli. Spiking neural networks. International journal of neural systems, 19(04):295–308, 2009.
  • Han et al. [2020] Bing Han, Gopalakrishnan Srinivasan, and Kaushik Roy. Rmp-snn: Residual membrane potential neuron for enabling deeper high-accuracy and low-latency spiking neural network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13558–13567, 2020.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hendrycks and Dietterich [2018] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2018.
  • Hendrycks et al. [2021] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021.
  • Hu et al. [2023] Yangfan Hu, Qian Zheng, Xudong Jiang, and Gang Pan. Fast-snn: fast spiking neural network by converting quantized ann. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • Huang et al. [2024] Yifan Huang, Wei Fang, Zhengyu Ma, Guoqi Li, and Yonghong Tian. Flexible and scalable deep dendritic spiking neural networks with multiple nonlinear branching. arXiv preprint arXiv:2412.06355, 2024.
  • Iwasawa and Matsuo [2021] Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. Advances in Neural Information Processing Systems, 34:2427–2440, 2021.
  • Jiang et al. [2023] Haiyan Jiang, Srinivas Anumasa, Giulia De Masi, Huan Xiong, and Bin Gu. A unified optimization framework of ANN-SNN conversion: Towards optimal mapping from activation values to firing rates. In Proceedings of the 40th International Conference on Machine Learning, pages 14945–14974. PMLR, 2023.
  • Khurana et al. [2021] Ansh Khurana, Sujoy Paul, Piyush Rai, Soma Biswas, and Gaurav Aggarwal. Sita: Single image test-time adaptation. arXiv preprint arXiv:2112.02355, 2021.
  • Kim and Panda [2021] Youngeun Kim and Priyadarshini Panda. Revisiting batch normalization for training low-latency deep spiking neural networks from scratch. Frontiers in neuroscience, 15:773954, 2021.
  • Koh et al. [2021] Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. Wilds: A benchmark of in-the-wild distribution shifts. In International conference on machine learning, pages 5637–5664. PMLR, 2021.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • LeCun et al. [1998] Yann LeCun, L eon Bottou, Yoshua Bengio, et al. Gradient-based learning applied to document recognition. PROC. OF THE IEEE, page 1, 1998.
  • Lee et al. [2020] Chankyu Lee, Syed Shakib Sarwar, Priyadarshini Panda, Gopalakrishnan Srinivasan, and Kaushik Roy. Enabling spike-based backpropagation for training deep neural network architectures. Frontiers in neuroscience, 14:497482, 2020.
  • Lee et al. [2013] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, page 896. Atlanta, 2013.
  • Lenz et al. [2021] Gregor Lenz, Kenneth Chaney, Sumit Bam Shrestha, Omar Oubari, Serge Picaud, and Guido Zarrella. Tonic: event-based datasets and transformations., 2021. Documentation available under https://tonic.readthedocs.io.
  • Lian et al. [2023] Shuang Lian, Jiangrong Shen, Qianhui Liu, Ziming Wang, Rui Yan, and Huajin Tang. Learnable surrogate gradient for direct training spiking neural networks. In IJCAI, pages 3002–3010, 2023.
  • Liang et al. [2020] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International conference on machine learning, pages 6028–6039. PMLR, 2020.
  • Neftci et al. [2019] Emre O Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6):51–63, 2019.
  • Niu et al. [2024] Shuaicheng Niu, Chunyan Miao, Guohao Chen, Pengcheng Wu, and Peilin Zhao. Test-time model adaptation with only forward passes. arXiv preprint arXiv:2404.01650, 2024.
  • Rathi et al. [2020] Nitin Rathi, Gopalakrishnan Srinivasan, Priyadarshini Panda, and Kaushik Roy. Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. In International Conference on Learning Representations, 2020.
  • Renner et al. [2024] Alpha Renner, Forrest Sheldon, Anatoly Zlotnik, Louis Tao, and Andrew Sornborger. The backpropagation algorithm implemented on spiking neuromorphic hardware. Nature Communications, 15(1):9691, 2024.
  • Roy et al. [2019] Kaushik Roy, Akhilesh Jaiswal, and Priyadarshini Panda. Towards spike-based machine intelligence with neuromorphic computing. Nature, 575(7784):607–617, 2019.
  • Sengupta et al. [2019] Abhronil Sengupta, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik Roy. Going deeper in spiking neural networks: Vgg and residual architectures. Frontiers in neuroscience, 13:95, 2019.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Sun et al. [2020] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International conference on machine learning, pages 9229–9248. PMLR, 2020.
  • Wang et al. [2020] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.
  • Wu et al. [2018] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience, 12:331, 2018.
  • Yin et al. [2021] Bojian Yin, Federico Corradi, and Sander M Bohté. Accurate and efficient time-domain classification with adaptive spiking recurrent neural networks. Nature Machine Intelligence, 3(10):905–913, 2021.
  • Zenke and Ganguli [2018] Friedemann Zenke and Surya Ganguli. Superspike: Supervised learning in multilayer spiking neural networks. Neural computation, 30(6):1514–1541, 2018.
  • Zhang et al. [2022] Marvin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. Advances in neural information processing systems, 35:38629–38642, 2022.
  • Zheng et al. [2024] Hanle Zheng, Zhong Zheng, Rui Hu, Bo Xiao, Yujie Wu, Fangwen Yu, Xue Liu, Guoqi Li, and Lei Deng. Temporal dendritic heterogeneity incorporated with spiking neural networks for learning multi-timescale dynamics. Nature Communications, 15(1):277, 2024.
\thetitle

Supplementary Material

6 Kernel-Based Consistency Regularization

To investigate whether higher-dimensional feature relationships can enhance the consistency measure, we integrate kernel embeddings into the loss function. The channel-wise feature probability distribution 𝐏csubscript𝐏𝑐\mathbf{P}_{c}bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT obtained from Eq. 4 can be mapped into a higher-dimensional reproducing kernel Hilbert space (RKHS) \mathcal{H}caligraphic_H, using a Gaussian kernel defined as

k(𝐳,𝐳)=exp(𝐳𝐳22σ2),𝑘𝐳superscript𝐳expsuperscriptnorm𝐳superscript𝐳22superscript𝜎2k(\mathbf{z},\mathbf{z}^{\prime})=\text{exp}(-\frac{\|\mathbf{z}-\mathbf{z}^{% \prime}\|^{2}}{2\sigma^{2}}),italic_k ( bold_z , bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = exp ( - divide start_ARG ∥ bold_z - bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (9)

where σ𝜎\sigmaitalic_σ is the kernel bandwidth. The kernel function k(𝐳,𝐳)𝑘𝐳superscript𝐳k(\mathbf{z},\mathbf{z}^{\prime})italic_k ( bold_z , bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) implicitly defines a mapping ϕitalic-ϕ\phiitalic_ϕ: 𝒵absent𝒵\mathcal{Z}\xrightarrow{}\mathcal{H}caligraphic_Z start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW caligraphic_H, such that the inner product in \mathcal{H}caligraphic_H is given by the kernel:

k(𝐳,𝐳)=ϕ(𝐳),ϕ(𝐳).𝑘𝐳superscript𝐳subscriptitalic-ϕ𝐳italic-ϕsuperscript𝐳k(\mathbf{z},\mathbf{z}^{\prime})=\langle\phi(\mathbf{z}),\phi(\mathbf{z}^{% \prime})\rangle_{\mathcal{H}}.italic_k ( bold_z , bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ⟨ italic_ϕ ( bold_z ) , italic_ϕ ( bold_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⟩ start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT . (10)

This allows us to compute relationships between feature vectors in a high-dimensional space without explicitly constructing ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ). To align feature distributions from augmented samples, we use the Mean Embedding of Distributions, which maps 𝐏c(𝐱i)subscript𝐏𝑐subscript𝐱𝑖\mathbf{P}_{c}(\mathbf{x}_{i})bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) into a single point μ𝐏c(𝐱i)subscript𝜇subscript𝐏𝑐subscript𝐱𝑖\mu_{\mathbf{P}_{c}(\mathbf{x}_{i})}italic_μ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT in RKHS. Specifically, the mean embedding is defined as

μ𝐏c(𝐱i)=𝔼𝐳i𝐏c(𝐱i)[ϕ(𝐳i)].subscript𝜇subscript𝐏𝑐subscript𝐱𝑖subscript𝔼similar-tosubscript𝐳𝑖subscript𝐏𝑐subscript𝐱𝑖delimited-[]italic-ϕsubscript𝐳𝑖\mu_{\mathbf{P}_{c}(\mathbf{x}_{i})}=\mathbb{E}_{\mathbf{z}_{i}\sim\mathbf{P}_% {c}(\mathbf{x}_{i})}[\phi(\mathbf{z}_{i})].italic_μ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_ϕ ( bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] . (11)

Given samples 𝐳i𝐏c(𝐱i)similar-tosubscript𝐳𝑖subscript𝐏𝑐subscript𝐱𝑖\mathbf{z}_{i}\sim\mathbf{P}_{c}(\mathbf{x}_{i})bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), the embedding can be approximated as

μ𝐏c(𝐱i)1Dd=1Dϕ(𝐳id).subscript𝜇subscript𝐏𝑐subscript𝐱𝑖1𝐷subscriptsuperscript𝐷𝑑1italic-ϕsubscript𝐳subscript𝑖𝑑\mu_{\mathbf{P}_{c}(\mathbf{x}_{i})}\triangleq\frac{1}{D}\sum^{D}_{d=1}\phi(% \mathbf{z}_{i_{d}}).italic_μ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ≜ divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT italic_ϕ ( bold_z start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) . (12)

To measure the discrepancy between distributions 𝐏c(𝐱i)subscript𝐏𝑐subscript𝐱𝑖\mathbf{P}_{c}(\mathbf{x}_{i})bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and 𝐏c(𝐱j)subscript𝐏𝑐subscript𝐱𝑗\mathbf{P}_{c}(\mathbf{x}_{j})bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) in RKHS, we employ the Maximum Mean Discrepancy (MMD):

MMD2(𝐏c(𝐱i),𝐏c(𝐱j))=μ𝐏c(𝐱i)μ𝐏c(𝐱j)2.superscriptMMD2subscript𝐏𝑐subscript𝐱𝑖subscript𝐏𝑐subscript𝐱𝑗subscriptsuperscriptnormsubscript𝜇subscript𝐏𝑐subscript𝐱𝑖subscript𝜇subscript𝐏𝑐subscript𝐱𝑗2\text{MMD}^{2}(\mathbf{P}_{c}(\mathbf{x}_{i}),\mathbf{P}_{c}(\mathbf{x}_{j}))=% \|\mu_{\mathbf{P}_{c}(\mathbf{x}_{i})}-\mu_{\mathbf{P}_{c}(\mathbf{x}_{j})}\|^% {2}_{\mathcal{H}}.MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) = ∥ italic_μ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT . (13)

Using the kernel trick, this can be computed without explicitly constructing ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ):

MMD2(𝐏c(𝐱i),𝐏c(𝐱j))=1D2d=1Dϕ(𝐳id)d=1Dϕ(𝐳jd)2.superscriptMMD2subscript𝐏𝑐subscript𝐱𝑖subscript𝐏𝑐subscript𝐱𝑗1superscript𝐷2subscriptsuperscriptdelimited-∥∥subscriptsuperscript𝐷𝑑1italic-ϕsubscript𝐳subscript𝑖𝑑subscriptsuperscript𝐷𝑑1italic-ϕsubscript𝐳subscript𝑗𝑑2\begin{split}\text{MMD}^{2}(\mathbf{P}_{c}(\mathbf{x}_{i}),\mathbf{P}_{c}(% \mathbf{x}_{j}))=\frac{1}{D^{2}}\|\sum^{D}_{d=1}\phi(\mathbf{z}_{i_{d}})-\sum^% {D}_{d=1}\phi(\mathbf{z}_{j_{d}})\|^{2}_{\mathcal{H}}.\end{split}start_ROW start_CELL MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) = divide start_ARG 1 end_ARG start_ARG italic_D start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ ∑ start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT italic_ϕ ( bold_z start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - ∑ start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT italic_ϕ ( bold_z start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_H end_POSTSUBSCRIPT . end_CELL end_ROW (14)

Then we average the MMD values across all channels:

MMD2(i,j|𝐱)=1Cc=1CMMD2(𝐏c(𝐱i),𝐏c(𝐱j)).superscriptMMD2𝑖conditional𝑗𝐱1𝐶superscriptsubscript𝑐1𝐶superscriptMMD2subscript𝐏𝑐subscript𝐱𝑖subscript𝐏𝑐subscript𝐱𝑗\text{MMD}^{2}(i,j|\mathbf{x})=\frac{1}{C}\sum_{c=1}^{C}\text{MMD}^{2}(\mathbf% {P}_{c}(\mathbf{x}_{i}),\mathbf{P}_{c}(\mathbf{x}_{j})).MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_i , italic_j | bold_x ) = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_P start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) . (15)

Finally, we integrate the average MMD values into the original loss function \mathcal{L}caligraphic_L:

(θ;𝐱)+λMMD1j<iMMMD2(i,j|𝐱)superscript𝜃𝐱subscript𝜆MMDsubscript1𝑗𝑖𝑀superscriptMMD2𝑖conditional𝑗𝐱\mathcal{L}^{*}(\theta;\mathbf{x})\triangleq\mathcal{L}+\lambda_{\text{MMD}}% \sum_{1\leq j<i\leq M}\text{MMD}^{2}(i,j|\mathbf{x})caligraphic_L start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_θ ; bold_x ) ≜ caligraphic_L + italic_λ start_POSTSUBSCRIPT MMD end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT 1 ≤ italic_j < italic_i ≤ italic_M end_POSTSUBSCRIPT MMD start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_i , italic_j | bold_x ) (16)
Shifted Domain w.o. TTA SPACE SPACE + Kernel Embedding
noise gaussian 72.38% 77.98% 77.85%
shot 74.70% 79.34% 79.34%
speckle 71.75% 77.38% 77.33%
impulse 58.57% 69.41% 69.25%
blur defocus 63.05% 71.59% 71.42%
gaussian 57.03% 68.31% 67.98%
motion 64.44% 72.14% 72.21%
zoom 71.33% 74.67% 74.61%
weather snow 76.32% 78.43% 78.48%
fog 43.57% 52.80% 52.46%
frost 75.72% 79.59% 79.66%
digital brightness 82.44% 83.22% 83.25%
contrast 22.54% 23.85% 23.40%
elastic_transform 75.01% 75.49% 75.42%
pixelate 70.25% 76.24% 76.13%
jpeg_compression 84.28% 82.88% 82.98%
spatter 78.42% 77.52% 77.47%
saturate 80.85% 82.54% 82.64%
Average 67.93% 72.41% 72.41%
Table 5: Performance comparison of SPACE without and with kernel embedding on CIFAR-10-C dataset at level 5 corruption, using the VGG9 architecture.
Shifted Domain w.o. TTA MEMO SPACE
DropPixel 44.53% 43.94% 46.97%
DropEvent 50.00% 51.52% 53.41%
RefractoryPeriod 35.16% 34.85% 39.39%
TimeJitter 42.19% 40.91% 42.42%
SpatialJitter 83.98% 82.95% 85.23%
UniformNoise 75.39% 74.24% 75.39%
Average 55.21% 54.74% 57.14%
Table 6: Performance comparison of different methods on DVS Gesture-C dataset at level 5 corruption.

To evaluate the impact of incorporating kernel embedding into the loss function, we tested its performance on CIFAR-10-C dataset, using the BNTT model with the VGG9 architecture, as shown in Tab. 5. The results indicate that adding kernel embedding provides little to no improvement over the original SPACE method, with the average performance remaining nearly identical. This limited effectiveness can be attributed to several factors. First, the distribution differences introduced by augmentations, such as noise, blur, and weather effects, are relatively small, reducing the need for advanced alignment mechanisms like kernel embedding. Additionally, the existing similarity measures in SPACE already effectively capture the relationships between augmented samples, leaving little room for further optimization. Moreover, the added computational complexity of kernel embedding, especially the cost of kernel calculations, may even slightly hinder performance in scenarios where the original approach is already sufficient. These factors combined suggest that kernel embedding is not particularly beneficial in this experimental setting.

7 Performance on DVS Gesture-C Dataset

SNNs are inherently suited for processing event-based neuromorphic data due to their temporal dynamics and energy efficiency. Static image datasets like CIFAR-10-C, CIFAR-100-C, and Tiny-ImageNet-C are commonly used to evaluate robustness to corruption and distribution shifts, but they lack the temporal and asynchronous characteristics of real-world event-based data. DVS-Gesture [1] allows us to evaluate the proposed method’s performance on neuromorphic datasets, emphasizing its compatibility with dynamic vision tasks and data with temporal structure.

Following the state-of-the-art methodology [18], we evaluate the robustness improvements achieved by SPACE using DVS-Gesture-C, a corrupted variant of the standard DVS-Gesture dataset. This variant introduces six distinct corruption types: DropPixel, DropEvent, RefractoryPeriod, TimeJitter, SpatialJitter, and UniformNoise. These corruptions, implemented via the Tonic API [28], effectively simulate real-world imperfections in event-based data, including sensor noise and timing inaccuracies. To ensure a comprehensive and stringent assessment, we consistently employ the highest severity level across all experimental evaluations. The pre-trained model used in this study follows a 2CONV-2FC architecture, which was selected for its balance between computational efficiency and representational capacity, ensuring a fair and reliable comparison.

The experimental results demonstrate that the proposed SPACE method achieves superior performance on the DVS-Gesture-C dataset compared to both w.o. TTA and the MEMO method. As shown in Tab. 6, SPACE consistently outperforms w.o. TTA across all six corruption types, with the largest improvements observed for RefractoryPeriod (+4.23%) and DropEvent (+3.41%). Compared to MEMO, SPACE also achieves higher accuracy in most scenarios, with notable gains for RefractoryPeriod (+4.54%) and DropPixel (+3.03%). On average, SPACE achieves an accuracy of 57.14%, significantly surpassing both w.o. TTA (55.21%) and MEMO (54.74%).

While MEMO improves TTA performance by enforcing consistency among augmented samples, it faces significant challenges in SNNs due to the loss of temporal information when relying on backpropagation. This limitation is evident in the results, where MEMO struggles to handle certain corruption types effectively, such as RefractoryPeriod and TimeJitter. In contrast, SPACE addresses these challenges by fully leveraging the temporal dynamics and event-based nature of SNNs, resulting in more robust and reliable performance across diverse corruption types. These results highlight the robustness and effectiveness of SPACE in improving model performance under challenging conditions, making it a highly suitable method for event-based data with various noise and corruption types.