Time Series Continuous Modeling for Imputation and Forecasting with Implicit Neural Representations

Etienne Le Naour* [email protected]
EDF R&D, Palaiseau, France
Sorbonne Université, CNRS, ISIR, 75005 Paris, France
Louis Serrano* [email protected]
Sorbonne Université, CNRS, ISIR, 75005 Paris, France
Léon Migus* [email protected]
Sorbonne Université, CNRS, ISIR, 75005 Paris, France
Sorbonne Université, CNRS, Laboratoire Jacques-Louis Lions, 75005 Paris, France
Yuan Yin [email protected]
Sorbonne Université, CNRS, ISIR, 75005 Paris, France
Ghislain Agoua [email protected]
EDF R&D, Palaiseau, France
Nicolas Baskiotis [email protected]
Sorbonne Université, CNRS, ISIR, 75005 Paris, France
Patrick Gallinari [email protected]
Sorbonne Université, CNRS, ISIR, 75005 Paris, France
Criteo AI Lab, Paris, France
Vincent Guigue [email protected]
AgroParisTech, Palaiseau, France
Abstract

We introduce a novel modeling approach for time series imputation and forecasting, tailored to address the challenges often encountered in real-world data, such as irregular samples, missing data, or unaligned measurements from multiple sensors. Our method relies on a continuous-time-dependent model of the series’ evolution dynamics. It leverages adaptations of conditional, implicit neural representations for sequential data. A modulation mechanism, driven by a meta-learning algorithm, allows adaptation to unseen samples and extrapolation beyond observed time-windows for long-term predictions. The model provides a highly flexible and unified framework for imputation and forecasting tasks across a wide range of challenging scenarios. It achieves state-of-the-art performance on classical benchmarks and outperforms alternative time-continuous models.

{NoHyper}* Equal contribution

1 Introduction

Time series analysis and modeling are ubiquitous in a wide range of fields, including industry, medicine, and climate science. The variety, heterogeneity and increasing number of deployed sensors, raise new challenges when dealing with real-world problems for which current methods often fail. For example, data are frequently irregularly sampled, contain missing values, or are unaligned when collected from distributed sensors (Schulz and Stattegger, 1997; Clark and Bjørnstad, 2004). Recent advancements in deep learning have significantly improved state-of-the-art performance in both data imputation (Cao et al., 2018; Du et al., 2023) and forecasting tasks (Zeng et al., 2022; Nie et al., 2022). Many state-of-the-art models, such as transformers, have been primarily designed for dense and regular grids (Wu et al., 2021; Nie et al., 2022; Du et al., 2023). They struggle to handle irregular data and often suffer from significant performance degradation (Chen et al., 2001; Kim et al., 2019).

Our objective is to explore alternatives to state-of-the-art (SOTA) transformers able to handle, in a unified framework, imputation and forecasting tasks for irregularly, arbitrarily sampled, and unaligned time series sources. Time-dependent continuous models (Rasmussen and Williams, 2006; Garnelo et al., 2018; Rubanova et al., 2019) offer such an alternative. However, until now, their performance has lagged significantly behind that of models designed for regular discrete grids. A few years ago, implicit neural representations (INRs) emerged as a powerful tool for representing images as continuous functions of spatial coordinates (Sitzmann et al., 2020; Tancik et al., 2020) with recent new applications such as image generation (Dupont et al., 2022) or even modeling dynamical systems (Yin et al., 2023).

In this work, we leverage the potential of conditional INR models within a meta-learning approach to introduce TimeFlow: a unified framework designed for modeling continuous time series and addressing imputation and forecasting tasks with irregular and unaligned observations. Our key contributions are:

  • We propose a novel framework that excels in modeling time series as continuous functions of time, accepting arbitrary time step inputs, thus enabling the handling of irregular and unaligned time series for both imputation and forecasting tasks. This is one of the very first attempts to adapt INRs that enables efficient handling of both imputation and forecasting tasks within a unified framework. The methodology which leverages the synergy between the model components, evidenced in the context of this application, is a pioneering contribution to the field.

  • We conducted an extensive comparison with state-of-the-art continuous and discrete models. It demonstrates that our approach outperforms continuous and discrete SOTA deep learning approaches for imputation. As for long-term forecasting, it outperforms existing continuous models both on regular and irregular samples. It is on par with SOTA discrete models on regularly sampled time series while allowing for a much greater flexibility for irregular samplings, allowing to cope with situations where discrete models fail. Furthermore, we prove that our method effortlessly handles previously unseen time series and new time windows, making it well-suited for real-world applications.

2 Related work

Discrete methods for time series imputation and forecasting.

Recently, Deep Learning (DL) methods have been widely used for both time series imputation and forecasting. For imputation, BRITS (Cao et al., 2018) uses a bidirectional recurrent neural network (RNN). Alternative frameworks were later explored, e.g., GAN-based (Luo et al., 2018; 2019; Liu et al., 2019), VAE-based (Fortuin et al., 2020), diffusion-based (Tashiro et al., 2021), matrix factorization-based (TIDER, Liu et al., 2023) and transformer-based (SAITS, Du et al., 2023) approaches. These methods cannot handle irregular time series. In situations involving multiple sensors, such as those placed at different locations, incorporating new sensors necessitates retraining the entire model, thereby limiting their usability. For forecasting, most recent DL SOTA models are based on transformers. Initial approaches apply plain transformers directly to the series, each token being a series element (Zhou et al., 2021; Liu et al., 2022; Wu et al., 2021; Zhou et al., 2022). These transformers may underperform linear models as shown in (Zeng et al., 2022). PatchTST (Nie et al., 2022) significantly improved transformers SOTA performance by considering sub-series as tokens of the series. However, all these models cannot handle properly irregularly sampled look-back windows.

Continuous methods for time series.

Gaussian Processes (Rasmussen and Williams, 2006) have been a popular family of methods for modeling time series as continuous functions. They require choosing an appropriate kernel (Corani et al., 2021) and may suffer limitations in large dimensions settings. Neural Processes (NPs) (Garnelo et al., 2018; Kim et al., 2019) parameterize Gaussian processes through an encoder-decoder architecture leading to more computationally efficient implementations. NPs have been used to model simple signals for imputation and forecasting tasks, but struggle with more complex signals. Bilos et al. (2023) parameterizes a Gaussian Process through a diffusion model, but the model has difficulty adapting to a large number of timestamps. Other approaches such as Brouwer et al. (2019) and Rubanova et al. (2019) model time series continuously with latent ordinary differential equations. mTAN (Shukla and Marlin, 2021), a transformer model, uses an attention mechanism to impute irregular time series. While these approaches have shown significant progress in continuous modeling for time series, we observed that their performances on imputation and forecasting tasks are inferior compared to the aforementioned discrete models (Table 1, Table 2).

Implicit neural representations.

The recent development of implicit neural representations (INRs) has led to impressive results in computer vision (Sitzmann et al., 2020; Tancik et al., 2020; Fathony et al., 2021; Mildenhall et al., 2021). INRs can represent data as a continuous function, which can be queried at any coordinate. While they have been applied in other fields such as physics (Yin et al., 2023) and meteorology (Huang and Hoefler, 2023), there has been limited research on INRs for time series analysis. Prior works (Fons et al., 2022; Jeong and Shin, 2022) focused on time series generation for data augmentation and on time series encoding for reconstruction but are limited by their fixed grid input requirement. DeepTime (Woo et al., 2022) is the closest work to our contribution. DeepTime learns a set of basis INR functions from a training set of multiple time series and combines them using a Ridge regressor. This regressor allows it to adapt to new time series. It has been designed for forecasting only. The original version cannot handle imputation properly and was adapted to do so for our comparisons. In our experiments, we will demonstrate that TimeFlow significantly outperforms DeepTime in imputation and also in forecasting tasks when dealing with missing values in the look-back window. TimeFlow also shows a slight advantage over DeepTime in forecasting regularly sampled series.

3 The TimeFlow framework

3.1 Problem setting

We aim to develop a unified framework for time series imputation and forecasting that reduces dependency on a fixed sampling scheme for time series. We introduce the following notations for both tasks. During training, in the imputation setting, we have access to time series in an observed temporal grid denoted as 𝒯insubscript𝒯𝑖𝑛\mathcal{T}_{in}caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, which is a subset of the dense temporal support 𝒯𝒯\mathcal{T}caligraphic_T. In the forecasting setting, we observe time series within a limited past time grid, referred to as the ’look-back window’ and denoted as 𝒯insubscript𝒯𝑖𝑛\mathcal{T}_{in}caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT (a subset of 𝒯𝒯\mathcal{T}caligraphic_T), as well as a future grid, the ’horizon’, denoted as 𝒯outsubscript𝒯𝑜𝑢𝑡\mathcal{T}_{out}caligraphic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT (also a subset of 𝒯𝒯\mathcal{T}caligraphic_T). At test time, in both cases, and given observed values in a temporal grid 𝒯insuperscriptsubscript𝒯𝑖𝑛\mathcal{T}_{in}^{*}caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT included in a possibly new temporal window 𝒯superscript𝒯\mathcal{T}^{*}caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, our objective is to infer the time series values within 𝒯superscript𝒯\mathcal{T}^{*}caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. 𝒯=𝒯superscript𝒯𝒯\mathcal{T}^{*}=\mathcal{T}caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_T if we infer values in the training temporal support (e.g. in the classical imputation scenario, see Section 4.1), 𝒯𝒯superscript𝒯𝒯\mathcal{T}^{*}\neq\mathcal{T}caligraphic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≠ caligraphic_T if we infer for a new temporal support(e.g. in the forecasting setting, see Section 4.2).

3.2 Key components

Our framework is articulated around three key components:

  1. (i)

    INR-based time-continuous functions: a discrete time series x=(xt1,xt2,,xtk)𝑥subscript𝑥subscript𝑡1subscript𝑥subscript𝑡2subscript𝑥subscript𝑡𝑘x=(x_{t_{1}},x_{t_{2}},\ldots,x_{t_{k}})italic_x = ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) can be represented by an underlying time-continuous function x:t+xtd:x𝑡subscriptsubscript𝑥𝑡superscript𝑑\textbf{x}\colon t\in\mathbb{R}_{+}\to x_{t}\in\mathbb{R}^{d}x : italic_t ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT → italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (in our experiments d=1𝑑1d=1italic_d = 1). We want to approximate the ground-truth x by employing implicit neural representations (INRs), which are neural networks capable of learning a parameterized continuous function fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT from discrete data by minimizing the reconstruction loss between observed data and network’s outputs.

  2. (ii)

    Conditional INRs with modulations: An INR can represent only one function, whether it’s an image or a time series. To effectively represent a collection of time series (x(j))jsubscriptsuperscript𝑥𝑗𝑗(x^{(j)})_{j}( italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT using INRs, we improve their encoding by incorporating per-sample modulations, which we denote as ψ(j)superscript𝜓𝑗\psi^{(j)}italic_ψ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT. These modulations condition the parameters θ𝜃\thetaitalic_θ of the INRs. We use the notation fθ,ψ(j)subscript𝑓𝜃superscript𝜓𝑗f_{\theta,\psi^{(j)}}italic_f start_POSTSUBSCRIPT italic_θ , italic_ψ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to refer to the conditioned INR with the modulations ψ(j)superscript𝜓𝑗\psi^{(j)}italic_ψ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT.

  3. (iii)

    Optimization-based encoding: the conditioning modulation parameters ψ(j)superscript𝜓𝑗\psi^{(j)}italic_ψ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT are calculated as a function of codes z(j)superscript𝑧𝑗z^{(j)}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT that represent the individual sample series. We acquire these codes z(j)superscript𝑧𝑗z^{(j)}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT through a meta-learning optimization process using an auto-decoding strategy. Notably, auto-decoding has been found to be more efficient for this purpose than set encoders (Kim et al., 2019).

In the following sections, we will elaborate on each component of our method. Given that the choices made for each component and the methodology developed to enhance their synergy are essential aspects, we provide a discussion of the various choices involved in Section 3.4.

INR-based time-continuous functions.

We implement our INR with Fourier features and a feed-forward network (FFN) with ReLU activations, i.e. for a time coordinate t𝒯𝑡𝒯t\in\mathcal{T}italic_t ∈ caligraphic_T, the output of the INR fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is given by fθ(t)=FFN(γ(t))subscript𝑓𝜃𝑡FFN𝛾𝑡f_{\theta}(t)=\text{FFN}(\gamma(t))italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t ) = FFN ( italic_γ ( italic_t ) ). The Fourier Features γ()𝛾\gamma(\cdot)italic_γ ( ⋅ ) are a frequency embedding of the time coordinates used to capture high-frequencies (Tancik et al., 2020; Mildenhall et al., 2021). In our case, we chose γ(t):=(sin(πt),cos(πt),,sin(2N1πt),cos(2N1πt))assign𝛾𝑡𝜋𝑡𝜋𝑡superscript2𝑁1𝜋𝑡superscript2𝑁1𝜋𝑡\gamma(t):=(\sin(\pi t),\cos(\pi t),\cdots,\sin(2^{N-1}\pi t),\cos(2^{N-1}\pi t))italic_γ ( italic_t ) := ( roman_sin ( italic_π italic_t ) , roman_cos ( italic_π italic_t ) , ⋯ , roman_sin ( 2 start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_π italic_t ) , roman_cos ( 2 start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_π italic_t ) ), with N𝑁Nitalic_N the number of fixed frequencies. For an INR with L𝐿Litalic_L layers, the output is computed as follows: (i) we get the frequency embedding ϕ0=γ(t)subscriptitalic-ϕ0𝛾𝑡\phi_{0}=\gamma(t)italic_ϕ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_γ ( italic_t ), (ii) we update the hidden states according to ϕl=ReLU(θlϕl1+bl)subscriptitalic-ϕ𝑙ReLUsubscript𝜃𝑙subscriptitalic-ϕ𝑙1subscript𝑏𝑙\phi_{l}=\text{ReLU}(\theta_{l}\phi_{l-1}+b_{l})italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ReLU ( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) for l=1,,L𝑙1𝐿l=1,\ldots,Litalic_l = 1 , … , italic_L, (iii) we project onto the output space fθ(t)=θL+1ϕL+bL+1subscript𝑓𝜃𝑡subscript𝜃𝐿1subscriptitalic-ϕ𝐿subscript𝑏𝐿1f_{\theta}(t)=\theta_{L+1}\phi_{L}+b_{L+1}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_t ) = italic_θ start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_L + 1 end_POSTSUBSCRIPT.

Refer to caption
Figure 1: Overview of TimeFlow architecture. Forward pass to approximate the time series x(j)superscript𝑥𝑗x^{(j)}italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT. σ𝜎\sigmaitalic_σ stands for the ReLU activation function.
Conditional INRs with modulations.

As indicated, sample conditioning of the INR is performed through modulations of its parameters. In order to adapt rapidly the model to new samples, the conditioning should rely only on a small number of the INR parameters. This is achieved by modifying only the biases of the INR through the introduction of an additional bias term ψl(j)superscriptsubscript𝜓𝑙𝑗\psi_{l}^{(j)}italic_ψ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT for each layer l𝑙litalic_l, also known as shift modulation. To further limit the versatility of the conditioning, we generate the instance modulations ψ(j)superscript𝜓𝑗\psi^{(j)}italic_ψ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT from compact codes z(j)superscript𝑧𝑗z^{(j)}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT through a linear hypernetwork hhitalic_h with parameters w𝑤witalic_w, i.e., ψ(j)=hw(z(j))superscript𝜓𝑗subscript𝑤superscript𝑧𝑗\psi^{(j)}=h_{w}(z^{(j)})italic_ψ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ). Consequently, the approximation of a time series x(j)superscript𝑥𝑗{x}^{(j)}italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, denoted globally as fθ,hw(z(j))subscript𝑓𝜃subscript𝑤superscript𝑧𝑗f_{\theta,h_{w}(z^{(j)})}italic_f start_POSTSUBSCRIPT italic_θ , italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT, will depend on shared parameters θ𝜃\thetaitalic_θ and w𝑤witalic_w that are common among all the INRs involved in modeling the series family and on the code z(j)superscript𝑧𝑗z^{(j)}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT specific to series x(j)superscript𝑥𝑗x^{(j)}italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT. The output of the l𝑙litalic_l-th layer of the modulated INR is given by ϕl=ReLU(θlϕl1+bl+ψl(j))subscriptitalic-ϕ𝑙ReLUsubscript𝜃𝑙subscriptitalic-ϕ𝑙1subscript𝑏𝑙superscriptsubscript𝜓𝑙𝑗\phi_{l}=\text{ReLU}(\theta_{l}\phi_{l-1}+b_{l}+\psi_{l}^{(j)})italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ReLU ( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ψ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ), where ψl(j)=Wlz(j)superscriptsubscript𝜓𝑙𝑗subscript𝑊𝑙superscript𝑧𝑗\psi_{l}^{(j)}=W_{l}z^{(j)}italic_ψ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, and w:=(Wl)l=1Lassign𝑤superscriptsubscriptsubscript𝑊𝑙𝑙1𝐿w:=(W_{l})_{l=1}^{L}italic_w := ( italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT are the parameters of the hypernetwork hwsubscript𝑤h_{w}italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. This design enables gathering information across samples into the common parameters of the INR and hypernetwork, while the codes contain only specific information about their respective time-series samples. The architecture is illustrated in Figure 1.

Optimization-based encoding.

We condition the INR using the data from 𝒯insubscript𝒯𝑖𝑛\mathcal{T}_{in}caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT, and learn the shared INR and hypernetwork parameters θ𝜃\thetaitalic_θ and w𝑤witalic_w using 𝒯insubscript𝒯𝑖𝑛\mathcal{T}_{in}caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT for both imputation and forecasting, and 𝒯outsubscript𝒯𝑜𝑢𝑡\mathcal{T}_{out}caligraphic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT for forecasting only. We achieve the conditioning on 𝒯insubscript𝒯𝑖𝑛\mathcal{T}_{in}caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT by optimizing the codes z(j)superscript𝑧𝑗z^{(j)}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT through gradient descent. The joint optimization of the codes and common parameters is challenging. In TimeFlow, it is achieved through a meta-learning approach, adapted from Dupont et al. (2022) and Zintgraf et al. (2019). The objective is to learn shared parameters so that the code z(j)superscript𝑧𝑗z^{(j)}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT can be adapted in just a few gradient steps for a new series x(j)superscript𝑥𝑗x^{(j)}italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT. For training, we perform parameter optimization at two levels: the inner-loop and the outer-loop. The inner-loop adapts the code z(j)superscript𝑧𝑗z^{(j)}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT to condition the network on the set 𝒯in(j)superscriptsubscript𝒯𝑖𝑛𝑗\mathcal{T}_{in}^{(j)}caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, while the outer-loop updates the common parameters using 𝒯in(j)superscriptsubscript𝒯𝑖𝑛𝑗\mathcal{T}_{in}^{(j)}caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT and also 𝒯out(j)superscriptsubscript𝒯𝑜𝑢𝑡𝑗\mathcal{T}_{out}^{(j)}caligraphic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT for forecasting (see Appendix H for more detailed intuition.). We present our training optimization in Algorithm 1. At each training epoch and for each batch of data \mathcal{B}caligraphic_B composed of time series x(j)superscript𝑥𝑗x^{(j)}italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT sampled from the training set, we first update individually the codes z(j)superscript𝑧𝑗z^{(j)}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT in the inner loop, before updating the common parameters in the outer loop using a loss over the whole batch. We introduce a parameter λ𝜆\lambdaitalic_λ to weight the importance of the loss over 𝒯outsubscript𝒯𝑜𝑢𝑡\mathcal{T}_{out}caligraphic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT w.r.t. the loss over 𝒯insubscript𝒯𝑖𝑛\mathcal{T}_{in}caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT for the outer-loop. In practice, when 𝒯outsubscript𝒯𝑜𝑢𝑡\mathcal{T}_{out}caligraphic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT exists, i.e. for forecasting, we set λ=1𝜆1\lambda=1italic_λ = 1 and λ=0𝜆0\lambda=0italic_λ = 0 otherwise. We use an MSE loss over the observations grid 𝒯(xt,xt~):=𝔼t𝒯[(xtxt~)2]assignsubscript𝒯subscript𝑥𝑡~subscript𝑥𝑡subscript𝔼similar-to𝑡𝒯delimited-[]superscriptsubscript𝑥𝑡~subscript𝑥𝑡2\mathcal{L}_{\mathcal{T}}(x_{t},\tilde{x_{t}}):=\mathbb{E}_{t\sim\mathcal{T}}[% (x_{t}-\tilde{x_{t}})^{2}]caligraphic_L start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) := blackboard_E start_POSTSUBSCRIPT italic_t ∼ caligraphic_T end_POSTSUBSCRIPT [ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]. We denote α𝛼\alphaitalic_α and η𝜂\etaitalic_η the learning rates of the inner-loop and outer-loop. Using K=3𝐾3K=3italic_K = 3 steps for training and testing is sufficient for our experiments thanks to the use of second-order meta-learning as explained in Section 3.4.

while no convergence do
       Sample batch \mathcal{B}caligraphic_B of data (x(j))jsubscriptsuperscript𝑥𝑗𝑗({x}^{(j)})_{j\in\mathcal{B}}( italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_j ∈ caligraphic_B end_POSTSUBSCRIPT;
       Set codes to zero z(j)0,jformulae-sequencesuperscript𝑧𝑗0for-all𝑗z^{(j)}\leftarrow 0,\forall j\in\mathcal{B}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ← 0 , ∀ italic_j ∈ caligraphic_B ;
       // inner loop for encoding:
       for j𝑗j\in\mathcal{B}italic_j ∈ caligraphic_B and step {1,,K}absent1𝐾\in\{1,...,K\}∈ { 1 , … , italic_K } do
             z(j)z(j)αz(j)𝒯in(j)(fθ,hw(z(j)),x(j))superscript𝑧𝑗superscript𝑧𝑗𝛼subscriptsuperscript𝑧𝑗subscriptsuperscriptsubscript𝒯𝑖𝑛𝑗subscript𝑓𝜃subscript𝑤superscript𝑧𝑗superscript𝑥𝑗z^{(j)}\!\leftarrow z^{(j)}-\alpha\nabla_{z^{(j)}}\mathcal{L}_{\mathcal{T}_{in% }^{(j)}}(f_{\theta,h_{w}(z^{(j)})},x^{(j)})italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ← italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT - italic_α ∇ start_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ , italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT );
            
      // outer loop step:
       [θ,w][θ,w]η[θ,w]1||j[𝒯in(j)(fθ,hw(z(j)),x(j))+λ𝒯out(j)(fθ,hw(z(j)),x(j))]𝜃𝑤𝜃𝑤𝜂subscript𝜃𝑤1subscript𝑗delimited-[]subscriptsuperscriptsubscript𝒯𝑖𝑛𝑗subscript𝑓𝜃subscript𝑤superscript𝑧𝑗superscript𝑥𝑗𝜆subscriptsuperscriptsubscript𝒯𝑜𝑢𝑡𝑗subscript𝑓𝜃subscript𝑤superscript𝑧𝑗superscript𝑥𝑗[\theta,w]\leftarrow[\theta,w]-\eta\nabla_{[\theta,w]}\frac{1}{|\mathcal{B}|}% \sum_{j\in\mathcal{B}}[\mathcal{L}_{\mathcal{T}_{in}^{(j)}}(f_{\theta,h_{w}(z^% {(j)})},x^{(j)})+\lambda\mathcal{L}_{\mathcal{T}_{out}^{(j)}}(f_{\theta,h_{w}(% z^{(j)})},x^{(j)})][ italic_θ , italic_w ] ← [ italic_θ , italic_w ] - italic_η ∇ start_POSTSUBSCRIPT [ italic_θ , italic_w ] end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_B end_POSTSUBSCRIPT [ caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ , italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) + italic_λ caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ , italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ] ;
      
Algorithm 1 TimeFlow Training

3.3 TimeFlow inference

During the inference process, we aim to infer the time series value for each timestamp in the dense grid 𝒯(j)superscript𝒯absent𝑗\mathcal{T}^{*(j)}caligraphic_T start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT based on the partial observation grid 𝒯in(j)𝒯(j)subscriptsuperscript𝒯absent𝑗𝑖𝑛superscript𝒯absent𝑗\mathcal{T}^{*(j)}_{in}\subset\mathcal{T}^{*(j)}caligraphic_T start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ⊂ caligraphic_T start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT. We can encounter two scenarios: (i) One where we observe the same time window as during training (𝒯(j)=𝒯(j)superscript𝒯absent𝑗superscript𝒯𝑗\mathcal{T}^{*(j)}=\mathcal{T}^{(j)}caligraphic_T start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT = caligraphic_T start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT) as in the imputation setting in Section 4.1. (ii) One, where we are dealing with a newly observed time window (𝒯(j)𝒯(j)superscript𝒯absent𝑗superscript𝒯𝑗\mathcal{T}^{*(j)}\neq\mathcal{T}^{(j)}caligraphic_T start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT ≠ caligraphic_T start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT), as in the forecasting setting in Section 4.2. At inference, the parameters θ𝜃\thetaitalic_θ and w𝑤witalic_w are kept fixed to their final training values. We optimize the individual parameters z(j)superscript𝑧absent𝑗z^{*(j)}italic_z start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT based on the newly observed grid 𝒯in(j)subscriptsuperscript𝒯absent𝑗𝑖𝑛\mathcal{T}^{*(j)}_{in}caligraphic_T start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT using the K𝐾Kitalic_K inner-steps of the meta-learning algorithm as described in Algorithm 2. We are then in position to query fθ,hw(z(j))(t)subscript𝑓𝜃subscript𝑤superscript𝑧absent𝑗𝑡f_{\theta,h_{w}(z^{*(j)})}(t)italic_f start_POSTSUBSCRIPT italic_θ , italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_t ) for any given timestamp t𝒯(j)𝑡superscript𝒯absent𝑗t\in\mathcal{T}^{*(j)}italic_t ∈ caligraphic_T start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT.

For the j𝑗jitalic_j-th series (x(j))superscript𝑥𝑗({x}^{(j)})( italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ), set code to zero z(j)0superscript𝑧absent𝑗0z^{*(j)}\leftarrow 0italic_z start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT ← 0;
for step {1,,K}absent1𝐾\in\{1,...,K\}∈ { 1 , … , italic_K } do
       z(j)z(j)αz(j)𝒯in(j)(fθ,hw(z(j)),xt)superscript𝑧absent𝑗superscript𝑧absent𝑗𝛼subscriptsuperscript𝑧absent𝑗subscriptsubscriptsuperscript𝒯absent𝑗𝑖𝑛subscript𝑓𝜃subscript𝑤superscript𝑧absent𝑗subscript𝑥𝑡z^{*(j)}\leftarrow z^{*(j)}-\alpha\nabla_{z^{*(j)}}\mathcal{L}_{\mathcal{T}^{{% *}(j)}_{in}}(f_{\theta,h_{w}(z^{*(j)})},x_{t})italic_z start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT ← italic_z start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT - italic_α ∇ start_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ , italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
Query fθ,hw(z(j))(t)subscript𝑓𝜃subscript𝑤superscript𝑧absent𝑗𝑡f_{\theta,h_{w}(z^{*(j)})}(t)italic_f start_POSTSUBSCRIPT italic_θ , italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT ( italic_t ) for any t𝒯(j)𝑡superscript𝒯absent𝑗t\in\mathcal{T}^{{*}(j)}italic_t ∈ caligraphic_T start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT
Algorithm 2 TimeFlow Inference with trained θ,w𝜃𝑤\theta,witalic_θ , italic_w

3.4 Discussion on implementation choices

As indicated before, adapting the components and enhancing their synergy for the tasks of imputation and forecasting is not trivial and requires careful choices. We conducted several ablation studies to provide a comprehensive examination of key implementation choices of our framework.

Our findings can be summarized as follows:

  • Choice of INR: An FFN with Fourier Features outperformed other popular INRs for the tasks considered in this study. Unlike SIREN (Sitzmann et al., 2020), which does not explicitly incorporate frequencies but uses sine activation functions, the Fourier features network can more effectively capture a wider range of frequencies, especially at low sampling rates. This is crucial for accurately capturing high frequencies in sparsely observed time series. Our experiments, detailed in Section B.2.1 and Table 4, demonstrate this superiority across various datasets.

  • Choice of encoding / meta-learning: TimeFlow with a set encoder for learning the compact conditioning codes z𝑧zitalic_z in place of the auto-decoding strategy used here, proved much less effective on complex datasets. This is further elaborated in Section B.2.4 and Table 11. Additionally, replacing the 2nd-order meta-learning optimization for a 1st-order one, such as REPTILE (Nichol et al., 2018), led to unstable training, as shown in Table 10.

  • Choice of modulations: Complexifying the modulation by introducing scaling parameters in addition to shift parameters did not provide performance gains. Our experiments on the Electricity dataset, detailed in Section B.2.5 and Table 12, indicate that shift-only modulation is more efficient.

For TimeFlow, across all experiments, we used a code dimension of 128, an FFN with a depth of 5 and a width of 256, and 64 Fourier features. We used 3 inner steps and a learning rate of 0.01 for the inner-loop, and a learning rate of 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for the outer-loop. We performed a comprehensive analysis to understand notably the influence of the z𝑧zitalic_z dimension: a latent code dimension of 128 was suitable for our tasks; this is supported by results in Section B.2.2 and Table 5 - and the influence of the number of inner steps: using 3 inner steps for training and inference struck a favorable balance between reconstruction capabilities and computational efficiency, as detailed in Section B.2.3.

4 Experiments

We conducted a comprehensive evaluation of our TimeFlow framework across three different tasks, comparing its performance to state-of-the-art continuous and discrete baseline methods. In Section 4.1, we assess TimeFlow’s capabilities to impute sparsely observed time series under various sampling rates. Section 4.2 focuses on long-term forecasting, where we evaluate TimeFlow over standard long-term forecasting horizons. In Section 4.3, we tackle a challenging task forecasting with incomplete look-back windows, thus combining the challenges of imputation and forecasting. The code for the experiments is available at this link.

Datasets.

We tested our framework on three extensive multivariate datasets where a single phenomenon is measured at multiple locations over time, namely Electricity, Traffic and Solar. The Electricity dataset comprises hourly electricity load curves of 321 customers in Portugal, spanning the years 2012 to 2014. The Traffic dataset is composed of hourly road occupancy rates from 862 locations in San Francisco during 2015 and 2016. Lastly, the Solar dataset contains measurements of solar power production from 137 photovoltaic plants in Alabama, recorded at 10-minute intervals in 2006. Additionally, we have created an hourly version, SolarH, for the sake of consistency in the forecasting section. These datasets exhibit diversity in various characteristics: They exhibit diverse temporal frequencies, including daily and weekly seasonality observed in the Traffic and Electricity datasets, while the Solar dataset possesses only daily frequency. There is individual variability across data samples and more pronounced trends in the Electricity dataset compared to the Traffic and Solar datasets.

4.1 Imputation

We consider the classical imputation setting where n𝑛nitalic_n time series are partially observed over a given time window. Using our approach, we can predict for each time series the value at any timestamp t𝑡titalic_t in that time window based on partial observations.

Setting.

For a time series x(j)superscript𝑥𝑗x^{(j)}italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, we denote the set of observed points as 𝒯in(j)superscriptsubscript𝒯𝑖𝑛𝑗\mathcal{T}_{in}^{{(j)}}caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT and the ground truth set of points as 𝒯(j)superscript𝒯𝑗\mathcal{T}^{{(j)}}caligraphic_T start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT. The observed time grids may be irregularly spaced and may differ across the different time series (𝒯in(j1)𝒯in(j2),j1j2formulae-sequencesuperscriptsubscript𝒯𝑖𝑛subscript𝑗1superscriptsubscript𝒯𝑖𝑛subscript𝑗2for-allsubscript𝑗1subscript𝑗2\mathcal{T}_{in}^{{(j_{1})}}\neq\mathcal{T}_{in}^{{(j_{2})}},\forall j_{1}\neq j% _{2}caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ≠ caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT , ∀ italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). The model is trained for each x(j)superscript𝑥𝑗x^{(j)}italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT following Algorithm 1. Then, we aim to infer for any unobserved t𝒯(j)𝑡superscript𝒯𝑗t\in\mathcal{T}^{{(j)}}italic_t ∈ caligraphic_T start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT the missing value xt(j)superscriptsubscript𝑥𝑡𝑗x_{t}^{(j)}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT conditioned on 𝒯in(j)subscriptsuperscript𝒯𝑗𝑖𝑛\mathcal{T}^{(j)}_{in}caligraphic_T start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT according to Algorithm 2. For this imputation task, the TimeFlow training and inference procedures are detailed in Section 3 and illustrated in Figure 2. For comparison with the SOTA imputation baselines, we assume that the ground truth time grid is the same for each sample. The subsampling rate τ𝜏\tauitalic_τ is define as the rate of observed values.

Refer to caption
Figure 2: Training and inference procedures of TimeFlow for imputation. (i) During training, for each time series x(j)superscript𝑥𝑗x^{(j)}italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, our observations (red dots \bullet ) are restricted to the sparsely sampled grid, denoted as 𝒯in(j)subscriptsuperscript𝒯𝑗𝑖𝑛\mathcal{T}^{(j)}_{in}caligraphic_T start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT. (ii) During inference, our objective is to infer the values over the dense grids 𝒯(j)superscript𝒯𝑗\mathcal{T}^{(j)}caligraphic_T start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, on the unobserved data points (such as the blue dots \bullet on the figure).
Baselines.

We compare TimeFlow with various baselines, including discrete imputation methods, such as CSDI (Tashiro et al., 2021), SAITS (Du et al., 2023), BRITS (Cao et al., 2018), and TIDER (Liu et al., 2023), and continuous ones, such as Neural Process (NP, Garnelo et al., 2018), mTAN (Shukla and Marlin, 2021), and DeepTime with slight adjustments (Woo et al., 2022) (details cf. Section D.3). See Section D.1.1 for the baseline training procedure and hyperparameter selection. For each dataset, we divide the series into five independent time windows (consisting of 2000 timestamps for Electricity and Traffic, and 10,000 timestamps for Solar), perform imputation on each time window and average the performance to obtain robust results. We evaluate the quality of the models for different subsampling rates, from the easiest τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 to the most difficult τ=0.05𝜏0.05\tau=0.05italic_τ = 0.05. All the scores presented in the experiments are reported as Mean Absolute Error (MAE).

Table 1: Mean MAE imputation results on the missing grid only. Each time series is divided into 5 time windows onto which imputation is performed, and the performances are averaged over the 5 windows. In the table, τ𝜏\tauitalic_τ stands for the subsampling rate, i.e. the proportion of observed points considered for each time window. Bold results are best, underlined results are second best. TimeFlow improvement represents the overall percentage improvement achieved by TimeFlow compared to the specific method being considered.
Continuous methods Discrete methods
τ𝜏\tauitalic_τ TimeFlow DeepTime mTAN Neural Process CSDI SAITS BRITS TIDER
0.05 0.324 ±plus-or-minus\pm± 0.013 0.379 ±plus-or-minus\pm± 0.037 0.575 ±plus-or-minus\pm± 0.039 0.357 ±plus-or-minus\pm± 0.015 0.462 ±plus-or-minus\pm± 0.021 0.384 ±plus-or-minus\pm± 0.019 0.329 ±plus-or-minus\pm± 0.015 0.427 ±plus-or-minus\pm± 0.010
0.10 0.250 ±plus-or-minus\pm± 0.010 0.333 ±plus-or-minus\pm± 0.034 0.412 ±plus-or-minus\pm± 0.047 0.417 ±plus-or-minus\pm± 0.057 0.398 ±plus-or-minus\pm± 0.072 0.308 ±plus-or-minus\pm± 0.011 0.287 ±plus-or-minus\pm± 0.015 0.399 ±plus-or-minus\pm± 0.009
Electricity 0.20 0.225 ±plus-or-minus\pm± 0.008 0.244 ±plus-or-minus\pm± 0.013 0.342 ±plus-or-minus\pm± 0.014 0.320 ±plus-or-minus\pm± 0.017 0.341 ±plus-or-minus\pm± 0.068 0.261 ±plus-or-minus\pm± 0.008 0.245 ±plus-or-minus\pm± 0.011 0.391 ±plus-or-minus\pm± 0.010
0.30 0.212 ±plus-or-minus\pm± 0.007 0.240 ±plus-or-minus\pm± 0.014 0.335 ±plus-or-minus\pm± 0.015 0.300 ±plus-or-minus\pm± 0.022 0.277 ±plus-or-minus\pm± 0.059 0.236 ±plus-or-minus\pm± 0.008 0.221 ±plus-or-minus\pm± 0.008 0.384 ±plus-or-minus\pm± 0.009
0.50 0.194 ±plus-or-minus\pm± 0.007 0.227 ±plus-or-minus\pm± 0.012 0.340 ±plus-or-minus\pm± 0.022 0.297 ±plus-or-minus\pm± 0.016 0.168 ±plus-or-minus\pm± 0.003 0.209 ±plus-or-minus\pm± 0.008 0.193 ±plus-or-minus\pm± 0.008 0.386 ±plus-or-minus\pm± 0.009
0.05 0.095 ±plus-or-minus\pm± 0.015 0.190 ±plus-or-minus\pm± 0.020 0.241 ±plus-or-minus\pm± 0.102 0.115 ±plus-or-minus\pm± 0.015 0.374 ±plus-or-minus\pm± 0.033 0.142 ±plus-or-minus\pm± 0.016 0.165 ±plus-or-minus\pm± 0.014 0.291 ±plus-or-minus\pm± 0.009
0.10 0.083 ±plus-or-minus\pm± 0.015 0.159 ±plus-or-minus\pm± 0.013 0.251 ±plus-or-minus\pm± 0.081 0.114 ±plus-or-minus\pm± 0.014 0.375 ±plus-or-minus\pm± 0.038 0.124 ±plus-or-minus\pm± 0.018 0.132 ±plus-or-minus\pm± 0.015 0.276 ±plus-or-minus\pm± 0.010
Solar 0.20 0.072 ±plus-or-minus\pm± 0.015 0.149 ±plus-or-minus\pm± 0.020 0.314 ±plus-or-minus\pm± 0.035 0.109 ±plus-or-minus\pm± 0.016 0.217 ±plus-or-minus\pm± 0.023 0.108 ±plus-or-minus\pm± 0.014 0.109 ±plus-or-minus\pm± 0.012 0.270 ±plus-or-minus\pm± 0.010
0.30 0.061 ±plus-or-minus\pm± 0.012 0.135 ±plus-or-minus\pm± 0.014 0.338 ±plus-or-minus\pm± 0.05 0.108 ±plus-or-minus\pm± 0.016 0.156 ±plus-or-minus\pm± 0.002 0.100 ±plus-or-minus\pm± 0.015 0.098 ±plus-or-minus\pm± 0.012 0.266 ±plus-or-minus\pm± 0.010
0.50 0.054 ±plus-or-minus\pm± 0.013 0.098 ±plus-or-minus\pm± 0.013 0.315 ±plus-or-minus\pm± 0.080 0.107 ±plus-or-minus\pm± 0.015 0.079 ±plus-or-minus\pm± 0.011 0.094 ±plus-or-minus\pm± 0.013 0.088 ±plus-or-minus\pm± 0.013 0.262 ±plus-or-minus\pm± 0.009
0.05 0.283 ±plus-or-minus\pm± 0.016 0.246 ±plus-or-minus\pm± 0.010 0.406 ±plus-or-minus\pm± 0.074 0.318 ±plus-or-minus\pm± 0.014 0.337 ±plus-or-minus\pm± 0.045 0.293 ±plus-or-minus\pm± 0.007 0.261 ±plus-or-minus\pm± 0.010 0.363 ±plus-or-minus\pm± 0.007
0.10 0.211 ±plus-or-minus\pm± 0.012 0.214 ±plus-or-minus\pm± 0.007 0.319 ±plus-or-minus\pm± 0.025 0.288 ±plus-or-minus\pm± 0.018 0.288 ±plus-or-minus\pm± 0.017 0.237 ±plus-or-minus\pm± 0.006 0.245 ±plus-or-minus\pm± 0.009 0.362 ±plus-or-minus\pm± 0.006
Traffic 0.20 0.168 ±plus-or-minus\pm± 0.006 0.216 ±plus-or-minus\pm± 0.006 0.270 ±plus-or-minus\pm± 0.012 0.271 ±plus-or-minus\pm± 0.011 0.269 ±plus-or-minus\pm± 0.017 0.197 ±plus-or-minus\pm± 0.005 0.224 ±plus-or-minus\pm± 0.008 0.361 ±plus-or-minus\pm± 0.006
0.30 0.151 ±plus-or-minus\pm± 0.007 0.172 ±plus-or-minus\pm± 0.008 0.251 ±plus-or-minus\pm± 0.006 0.259 ±plus-or-minus\pm± 0.012 0.240 ±plus-or-minus\pm± 0.037 0.180 ±plus-or-minus\pm± 0.006 0.197 ±plus-or-minus\pm± 0.007 0.355 ±plus-or-minus\pm± 0.006
0.50 0.139 ±plus-or-minus\pm± 0.007 0.171 ±plus-or-minus\pm± 0.005 0.278 ±plus-or-minus\pm± 0.040 0.240 ±plus-or-minus\pm± 0.021 0.144 ±plus-or-minus\pm± 0.022 0.160 ±plus-or-minus\pm± 0.008 0.161 ±plus-or-minus\pm± 0.060 0.354 ±plus-or-minus\pm± 0.007
TimeFlow improvement / 24.14 %percent\%% 50.53 %percent\%% 31.61 %percent\%% 36.12 %percent\%% 20.33 %percent\%% 18.90 %percent\%% 53.40 %percent\%%
Results.

We show in Table 1 that TimeFlow outperforms both discrete and continuous models across almost all τ𝜏\tauitalic_τ’s for the given datasets. The relative improvements of TimeFlow, as defined in Appendix C, over baselines are significant, ranging from 15% to 50%. Especially for the lowest sampling rate τ=0.05𝜏0.05\tau=0.05italic_τ = 0.05, TimeFlow outperforms all discrete baselines, demonstrating the advantages of continuous modeling. Additionally, it achieves lower imputation errors compared to continuous models in all but one cases. Qualitatively, we see on example series in Figure 3 that our model shows significant imputation capabilities, with a subsampling rate at τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1 on the Electricity dataset. It captures well different frequencies and amplitudes in a challenging case (sample 35), although it underestimates the amplitude of some peaks. In a more challenging scenario (sample 25), where the series exhibit additional trend changes and frequency variations within the data, TimeFlow correctly imputes most timestamps, outperforming BRITS, which is the best-performing method for the Electricity dataset.

Refer to caption
Figure 3: Electricity dataset. TimeFlow imputation (blue line) and BRITS imputation (gray line) with 10%percent\%% of known point (red points) on the eight first days of samples 35 (top) and 25 (bottom).
Imputation on previously unseen time series.

In more practical scenarios, such as cases involving the installation of new sensors, we often encounter new time series originating from the same underlying phenomenon. In such instances, it becomes crucial to make inferences for these previously unseen time series. Thanks to efficient adaptation in latent space, our model can easily be applied to these new time series (as shown in Section D.2, Table 18), contrasting with SOTA methods like SAITS and BRITS, which require full model retraining on the whole set of time series.

4.2 Forecasting

4.2.1 Forecasting for known time series

In this section, we are interested in the conventional long-term forecasting scenario. It consists in predicting the phenomenon in a specific future period, the horizon, based on the history of a limited past period, the look-back window. The forecaster is trained on a set of n𝑛nitalic_n observed time series for a given time window (train period) and tested on new distinct time windows.

Setting.

For a given time series x(j)superscript𝑥𝑗x^{(j)}italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, 𝒯in(j)subscriptsuperscript𝒯𝑗𝑖𝑛\mathcal{T}^{(j)}_{in}caligraphic_T start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT denotes the look-back window and 𝒯out(j)subscriptsuperscript𝒯𝑗𝑜𝑢𝑡\mathcal{T}^{(j)}_{out}caligraphic_T start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT the horizon of H𝐻Hitalic_H points. During training, at each epoch, we train fθ,hw(z(j))subscript𝑓𝜃subscript𝑤superscript𝑧𝑗f_{\theta,h_{w}(z^{(j)})}italic_f start_POSTSUBSCRIPT italic_θ , italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT following Algorithm 1 with randomly drawn pairs of look-back window and horizon (𝒯in(j)𝒯out(j))jsubscriptsubscriptsuperscript𝒯𝑗𝑖𝑛subscriptsuperscript𝒯𝑗𝑜𝑢𝑡𝑗(\mathcal{T}^{(j)}_{in}\cup\mathcal{T}^{(j)}_{out})_{j\in\mathcal{B}}( caligraphic_T start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∪ caligraphic_T start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j ∈ caligraphic_B end_POSTSUBSCRIPT within the observed train period. Then, for a distinct new time window 𝒯(j)superscript𝒯absent𝑗\mathcal{T}^{*(j)}caligraphic_T start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT, given a look-back window 𝒯in(j)superscriptsubscript𝒯𝑖𝑛absent𝑗\mathcal{T}_{in}^{*(j)}caligraphic_T start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT we forecast future values any t𝒯(j)𝑡superscript𝒯absent𝑗t\in\mathcal{T}^{*(j)}italic_t ∈ caligraphic_T start_POSTSUPERSCRIPT ∗ ( italic_j ) end_POSTSUPERSCRIPT, the horizon interval, following Algorithm 2. We illustrate the training and inference of TimeFlow for the forecasting task in Figure 4. For further insight into the training window and inference periods, as well as additional experiments conducted under different inference scenarios, see Section E.1.

Refer to caption
Figure 4: Training and inference procedure of TimeFlow for forecasting. (i) During training (top-figure), for each time series x(j)superscript𝑥𝑗x^{(j)}italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, we observe some look-back window/horizon drawing pairs in the trained period. TimeFlow is trained with Algorithm 1 to predict all observed timestamps in this drawing pairs while being conditioned by the observed look-back window. (ii) Once TimeFlow is optimized, the objective during inference (bottom-figure) is to infer the horizon over new time windows (blue dots \bullet ) while being conditioned by the newly observed look-back window (red dots \bullet ).
Baselines.

To evaluate the quality of our model in long-term forecasting, we compare it to the discrete baselines PatchTST (Nie et al., 2022), DLinear (Zeng et al., 2022), AutoFormer (Wu et al., 2021), and Informer (Zhou et al., 2021). We also include continuous baselines DeepTime and Neural Process (NP). See Section E.3.1 for the baseline training procedure and hyperparameter selection. In Table 2, we present the forecasting results for standard horizons in long-term forecasting: H{96,192,336,720}𝐻96192336720H\in\{96,192,336,720\}italic_H ∈ { 96 , 192 , 336 , 720 }. The look-back window length is fixed to 512.

Table 2: Mean MAE forecast results averaged over different time windows. Each time, the model is trained on one time window and tested on the others (there are 2 windows for SolarH and 5 for Electricity and Traffic). H𝐻Hitalic_H stands for the horizon. Bold results are best, and underlined results are second best. TimeFlow improvement represents the overall percentage improvement achieved by TimeFlow compared to the specific method being considered.
Continuous methods Discrete methods
H𝐻Hitalic_H TimeFlow DeepTime Neural Process Patch-TST DLinear AutoFormer Informer
Electricity 96 0.228 ±plus-or-minus\pm± 0.028 0.244 ±plus-or-minus\pm± 0.026 0.392 ±plus-or-minus\pm± 0.045 0.221 ±plus-or-minus\pm± 0.023 0.241 ±plus-or-minus\pm± 0.030 0.546 ±plus-or-minus\pm± 0.277 0.603 ±plus-or-minus\pm± 0.255
192 0.238 ±plus-or-minus\pm± 0.020 0.252 ±plus-or-minus\pm± 0.019 0.401 ±plus-or-minus\pm± 0.046 0.229 ±plus-or-minus\pm± 0.020 0.252 ±plus-or-minus\pm± 0.025 0.500 ±plus-or-minus\pm± 0.190 0.690 ±plus-or-minus\pm± 0.291
336 0.270 ±plus-or-minus\pm± 0.031 0.284 ±plus-or-minus\pm± 0.034 0.434 ±plus-or-minus\pm± 0.076 0.251 ±plus-or-minus\pm± 0.027 0.288 ±plus-or-minus\pm± 0.038 0.523 ±plus-or-minus\pm± 0.188 0.736 ±plus-or-minus\pm± 0.271
720 0.316 ±plus-or-minus\pm± 0.055 0.359 ±plus-or-minus\pm± 0.051 0.607 ±plus-or-minus\pm± 0.150 0.297 ±plus-or-minus\pm± 0.039 0.365 ±plus-or-minus\pm± 0.059 0.631 ±plus-or-minus\pm± 0.237 0.746 ±plus-or-minus\pm± 0.265
SolarH 96 0.190 ±plus-or-minus\pm± 0.013 0.190 ±plus-or-minus\pm± 0.020 0.221 ±plus-or-minus\pm± 0.048 0.262 ±plus-or-minus\pm± 0.070 0.208 ±plus-or-minus\pm± 0.014 0.245 ±plus-or-minus\pm± 0.045 0.248 ±plus-or-minus\pm± 0.022
192 0.202 ±plus-or-minus\pm± 0.020 0.204 ±plus-or-minus\pm± 0.028 0.244 ±plus-or-minus\pm± 0.048 0.253 ±plus-or-minus\pm± 0.051 0.217 ±plus-or-minus\pm± 0.022 0.333 ±plus-or-minus\pm± 0.107 0.270 ±plus-or-minus\pm± 0.031
336 0.209 ±plus-or-minus\pm± 0.017 0.199 ±plus-or-minus\pm± 0.026 0.240 ±plus-or-minus\pm± 0.006 0.259 ±plus-or-minus\pm± 0.071 0.217 ±plus-or-minus\pm± 0.026 0.334 ±plus-or-minus\pm± 0.079 0.328 ±plus-or-minus\pm± 0.048
720 0.218 ±plus-or-minus\pm± 0.041 0.229 ±plus-or-minus\pm± 0.024 0.403 ±plus-or-minus\pm± 0.147 0.267 ±plus-or-minus\pm± 0.064 0.249 ±plus-or-minus\pm± 0.034 0.351 ±plus-or-minus\pm± 0.055 0.337 ±plus-or-minus\pm± 0.037
Traffic 96 0.217 ±plus-or-minus\pm± 0.032 0.228 ±plus-or-minus\pm± 0.032 0.283 ±plus-or-minus\pm± 0.027 0.203 ±plus-or-minus\pm± 0.037 0.228 ±plus-or-minus\pm± 0.033 0.319 ±plus-or-minus\pm± 0.059 0.372 ±plus-or-minus\pm± 0.078
192 0.212 ±plus-or-minus\pm± 0.028 0.220 ±plus-or-minus\pm± 0.022 0.292 ±plus-or-minus\pm± 0.024 0.197 ±plus-or-minus\pm± 0.030 0.221 ±plus-or-minus\pm± 0.023 0.368 ±plus-or-minus\pm± 0.057 0.511 ±plus-or-minus\pm± 0.247
336 0.238 ±plus-or-minus\pm± 0.034 0.245 ±plus-or-minus\pm± 0.038 0.305 ±plus-or-minus\pm± 0.039 0.222 ±plus-or-minus\pm± 0.039 0.250 ±plus-or-minus\pm± 0.040 0.434 ±plus-or-minus\pm± 0.061 0.561 ±plus-or-minus\pm± 0.263
720 0.279 ±plus-or-minus\pm± 0.050 0.290 ±plus-or-minus\pm± 0.052 0.339 ±plus-or-minus\pm± 0.038 0.269 ±plus-or-minus\pm± 0.057 0.300 ±plus-or-minus\pm± 0.057 0.462 ±plus-or-minus\pm± 0.062 0.638 ±plus-or-minus\pm± 0.067
TimeFlow improvement / 3.74 %percent\%% 29.06 %percent\%% 3.23 %percent\%% 6.92 %percent\%% 42.09 %percent\%% 48.57 %percent\%%
Results.

The results in Table 2 show that our approach ranks in the top two across all datasets and horizons and is the overall best continuous method. TimeFlow’s performance is comparable to the current SOTA model PatchTST, with only 2% relative difference. Moreover, TimeFlow shows consistent results across the three datasets, whereas the other best discrete and continuous baselines, i.e. PatchTST and DeepTime, performance drops for some datasets. We also note that, despite the great performance of the SOTA PatchTST, other transformer-based baselines (discrete methods in Table 2) perform poorly. We provide a detailed insight on these results in Section E.1. Overall, although this evaluation setting favors discrete methods because the time series are observed at evenly distributed time steps, TimeFlow consistently performs as well as PatchTST and outperforms all the other methods, whether discrete or continuous. It is the first time that a continuous model has achieved the same level of performance as discrete methods within their specific setting.

4.2.2 Forecasting on previously unseen time series.

This section discusses how TimeFlow adapts to unseen time series, which is critical in forecasting. Indeed, in many real-world applications, forecasters are trained on a limited subset of available samples and applied to a wider range of samples during inference. Informer, AutoFormer, or DLinear original architectures directly model the relationships between time series (channel-dependence), limiting their adaptability to new samples. In contrast, TimeFlow takes a different approach by considering the observed series at different locations as distinct samples, similar to PatchTST, Neural Process, and DeepTime. This independence allows TimeFlow to effectively generalize to previously unseen time series of the same phenomenon.

Setting.

In this setting, we propose to evaluate how TimeFlow performs on previously unseen time series. We compare it to the best forecaster, PatchTST. We train TimeFlow and PatchTST on 50 % of the samples and consider the remaining 50 % as the new time series. The training procedure is the same as described in Figure 4. In Figure 5, we present the results of TimeFlow and PatchTST for both known and new samples (for periods outside the training window).

Results.

The results in Figure 5 highlight two key observations. First, both approaches show robust adaptability to new samples, as evidenced by the minimal difference in mean absolute error between known and new samples at inference. Second, TimeFlow and PatchTST exhibit comparable performance in this context, with negligible differences across horizons and datasets.

Refer to caption
Figure 5: Mean MAE forecasting task results over different horizons in the context of generalization to new time series. Comparison of TimeFlow and PatchTST performances on the Electricity, Traffic and SolarH datasets.

4.3 Challenging task: Forecast while imputing incomplete look-back windows

In real-world scenarios, it is common to encounter missing or irregularly sampled series when making predictions on new time windows (Cinar et al., 2018; Tang et al., 2020). Continuous methods can handle these cases, as they are designed to accommodate irregular sampling within the look-back window. In this section, we formulate a task to simulate these real-world scenarios. It’s worth noting that this task is often encountered in practice but is rarely considered in the DL literature.

Setting and baselines.

This scenario is similar to the forecast setting in Section 4.2 and illustrated in Figure 4. The difference is that during inference, the look-back window is subsampled at a rate τ𝜏\tauitalic_τ smaller than the one used for the training phase. This simulates a situation with missing observations in the look back window. Consequently, two distinct tasks emerge during the inference phase: imputing missing points within the sparsely observed look-back window, and forecasting over the horizon with this degraded context. In Table 3, we compare to the two other continuous baselines, DeepTime and NP on Electricity and Traffic for different τ𝜏\tauitalic_τ’s and horizons.

Table 3: MAE results for forecasting with missing values in the look-back window. τ𝜏\tauitalic_τ stands for the percentage of observed values in the look-back window. Best results are in bold.
TimeFlow DeepTime Neural Process
H𝐻Hitalic_H τ𝜏\tauitalic_τ Imputation error Forecast error Imputation error Forecast error Imputation error Forecast error
Electricity 96 0.5 0.151 ±plus-or-minus\pm± 0.003 0.239 ±plus-or-minus\pm± 0.013 0.209 ±plus-or-minus\pm± 0.004 0.270 ±plus-or-minus\pm± 0.019 0.460 ±plus-or-minus\pm± 0.048 0.486 ±plus-or-minus\pm± 0.078
0.2 0.208 ±plus-or-minus\pm± 0.006 0.260 ±plus-or-minus\pm± 0.015 0.249 ±plus-or-minus\pm± 0.006 0.296 ±plus-or-minus\pm± 0.023 0.644 ±plus-or-minus\pm± 0.079 0.650 ±plus-or-minus\pm± 0.095
0.1 0.272 ±plus-or-minus\pm± 0.006 0.295 ±plus-or-minus\pm± 0.016 0.284 ±plus-or-minus\pm± 0.007 0.324 ±plus-or-minus\pm± 0.026 0.740 ±plus-or-minus\pm± 0.083 0.737 ±plus-or-minus\pm± 0.106
192 0.5 0.149 ±plus-or-minus\pm± 0.004 0.235 ±plus-or-minus\pm± 0.011 0.204 ±plus-or-minus\pm± 0.004 0.265 ±plus-or-minus\pm± 0.018 0.461 ±plus-or-minus\pm± 0.045 0.498 ±plus-or-minus\pm± 0.070
0.2 0.209 ±plus-or-minus\pm± 0.006 0.257 ±plus-or-minus\pm± 0.013 0.244 ±plus-or-minus\pm± 0.007 0.290 ±plus-or-minus\pm± 0.023 0.601 ±plus-or-minus\pm± 0.075 0.626 ±plus-or-minus\pm± 0.101
0.1 0.274 ±plus-or-minus\pm± 0.010 0.289 ±plus-or-minus\pm± 0.016 0.282 ±plus-or-minus\pm± 0.007 0.315 ±plus-or-minus\pm± 0.025 0.461 ±plus-or-minus\pm± 0.045 0.724 ±plus-or-minus\pm± 0.090
Traffic 96 0.5 0.180 ±plus-or-minus\pm± 0.016 0.219 ±plus-or-minus\pm± 0.026 0.272 ±plus-or-minus\pm± 0.028 0.243 ±plus-or-minus\pm± 0.030 0.436 ±plus-or-minus\pm± 0.025 0.444 ±plus-or-minus\pm± 0.047
0.2 0.239 ±plus-or-minus\pm± 0.019 0.243 ±plus-or-minus\pm± 0.027 0.335 ±plus-or-minus\pm± 0.026 0.293 ±plus-or-minus\pm± 0.027 0.596 ±plus-or-minus\pm± 0.049 0.597 ±plus-or-minus\pm± 0.075
0.1 0.312 ±plus-or-minus\pm± 0.020 0.290 ±plus-or-minus\pm± 0.027 0.385 ±plus-or-minus\pm± 0.025 0.344 ±plus-or-minus\pm± 0.027 0.734 ±plus-or-minus\pm± 0.102 0.731 ±plus-or-minus\pm± 0.132
192 0.5 0.176 ±plus-or-minus\pm± 0.014 0.217 ±plus-or-minus\pm± 0.017 0.241 ±plus-or-minus\pm± 0.027 0.234 ±plus-or-minus\pm± 0.021 0.477 ±plus-or-minus\pm± 0.042 0.476 ±plus-or-minus\pm± 0.043
0.2 0.233 ±plus-or-minus\pm± 0.017 0.236 ±plus-or-minus\pm± 0.021 0.286 ±plus-or-minus\pm± 0.027 0.276 ±plus-or-minus\pm± 0.020 0.685 ±plus-or-minus\pm± 0.109 0.678 ±plus-or-minus\pm± 0.108
0.1 0.304 ±plus-or-minus\pm± 0.019 0.277 ±plus-or-minus\pm± 0.021 0.331 ±plus-or-minus\pm± 0.025 0.324 ±plus-or-minus\pm± 0.021 0.888 ±plus-or-minus\pm± 0.178 0.877 ±plus-or-minus\pm± 0.174
TimeFlow improvement / / 18.97 %percent\%% 11.87 %percent\%% 61.88 %percent\%% 58.41 %percent\%%
Results.

In Table 3, the results show that TimeFlow consistently outperforms other methods in imputation and forecasting for every scenarios. When comparing with the complete look-back windows observations scenario from Table 2, one observes that at a 0.5 sampling rate, TimeFlow presents only a slight reduction in performance, whereas other baseline methods experience more significant drops. For instance, when we compare forecast results between a complete window and a τ=0.5𝜏0.5\tau=0.5italic_τ = 0.5 subsampled window for Electricity with a forecasting horizon of H=96𝐻96H=96italic_H = 96, TimeFlow’s error increases by a mere 4.6% (from 0.228 to 0.239). In contrast, DeepTime’s error grows by over 10% (from 0.244 to 0.270), and NP experiences a rise of around 25% (from 0.392 to 0.486). For lower sampling rates, TimeFlow still delivers correct predictions. Qualitatively, we see on the series example in Figure 6 that despite observing only 10% of the look-back window, the model can correctly infer both the complete look-back window and the horizon. Both quantitative and qualitative results show the robustness and efficiency of TimeFlow on this particularly challenging setting.

Refer to caption
Figure 6: Traffic dataset, sample 95. In this figure, TimeFlow simultaneously imputes and forecasts at horizon 96 with a 10% partially observed look-back window of length 512.

5 Limitations

While TimeFlow shows promising performance across various tasks and settings, it is important to recognize some limitations. First, due to its auto-decoding process, TimeFlow tends to be significantly slower at inference time compared to other baselines by one to two orders of magnitude (Table 23). In addition, although TimeFlow effectively handles sets of homogeneous time series, additional mechanisms are required to handle heterogeneous time series with different frequencies effectively. The per-context shift modulation mechanism does not allow TimeFlow to fit time series with drastically different structures. This also explains why TimeFlow is particularly well adapted to regular frequency patterns.In the experiment section, all considered datasets exhibit pronounced periodicities. Finally, it should be noted that effective training of TimeFlow requires a relatively large number of samples (typically \geq 100) to allow the model to accurately distinguish between individual patterns and shared information.

6 Conclusion

We have introduced a unified framework for continuous time series modeling leveraging conditional INR and meta-learning. Our experiments have demonstrated superior performance compared to other continuous methods, and better or comparable results to SOTA discrete methods. One of the standout features of our framework is its inherent continuity and the ability to modulate the INR parameters. This unique flexibility lets TimeFlow effectively tackle a wide array of challenges, including forecasting in the presence of missing values, accommodating irregular time steps, and extending the trained model’s applicability to previously unseen time series and new time windows. Our empirical results have shown TimeFlow’s effectiveness in handling homogeneous multivariate time series. As a logical next step, extending TimeFlow’s capabilities to address heterogeneous multivariate phenomena represents a promising direction for future research.

Acknowledgment

We would like to thank Tahar Nabil for the valuable discussions on this project.

References

  • Bilos et al. [2023] M. Bilos, K. Rasul, A. Schneider, Y. Nevmyvaka, and S. Günnemann. Modeling temporal data as continuous functions with stochastic process diffusion. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML, volume 202 of Proceedings of Machine Learning Research, pages 2452–2470. PMLR, 2023.
  • Brouwer et al. [2019] E. D. Brouwer, J. Simm, A. Arany, and Y. Moreau. Gru-ode-bayes: continuous modeling of sporadically-observed time series. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 7379–7390, 2019.
  • Cao et al. [2018] W. Cao, D. Wang, J. Li, H. Zhou, Y. Li, and L. Li. Brits: bidirectional recurrent imputation for time series. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 6776–6786, 2018.
  • Chen et al. [2001] H. Chen, S. Grant-Muller, L. Mussone, and F. Montgomery. A study of hybrid neural network approaches and the effects of missing data on traffic forecasting. Neural Computing & Applications, 10:277–286, 2001.
  • Chen and Guestrin [2016] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
  • Cinar et al. [2018] Y. G. Cinar, H. Mirisaee, P. Goswami, É. Gaussier, and A. Aït-Bachir. Period-aware content attention rnns for time series forecasting with missing values. Neurocomputing, 312:177–186, 2018.
  • Clark and Bjørnstad [2004] J. S. Clark and O. N. Bjørnstad. Population time series: process variability, observation errors, missing values, lags, and hidden states. Ecology, 85(11):3140–3150, 2004.
  • Corani et al. [2021] G. Corani, A. Benavoli, and M. Zaffalon. Time series forecasting with gaussian processes needs priors. In Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track: European Conference, ECML PKDD 2021, Bilbao, Spain, September 13–17, 2021, Proceedings, Part IV 21, pages 103–117. Springer, 2021.
  • Du et al. [2023] W. Du, D. Côté, and Y. Liu. Saits: Self-attention-based imputation for time series. Expert Systems with Applications, 219:119619, 2023.
  • Dupont et al. [2022] E. Dupont, H. Kim, S. M. A. Eslami, D. J. Rezende, and D. Rosenbaum. From data to functa: Your data point is a function and you can treat it like one. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 5694–5725. PMLR, 2022.
  • Fathony et al. [2021] R. Fathony, A. K. Sahu, D. Willmott, and J. Z. Kolter. Multiplicative filter networks. In International Conference on Learning Representations, 2021.
  • Finn et al. [2017] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
  • Fons et al. [2022] E. Fons, A. Sztrajman, Y. El-Laham, A. Iosifidis, and S. Vyetrenko. Hypertime: Implicit neural representation for time series. CoRR, abs/2208.05836, 2022.
  • Fortuin et al. [2020] V. Fortuin, D. Baranchuk, G. Rätsch, and S. Mandt. Gp-vae: Deep probabilistic time series imputation. In International conference on artificial intelligence and statistics, pages 1651–1661. PMLR, 2020.
  • Garnelo et al. [2018] M. Garnelo, D. Rosenbaum, C. Maddison, T. Ramalho, D. Saxton, M. Shanahan, Y. W. Teh, D. J. Rezende, and S. M. A. Eslami. Conditional neural processes. In Proceedings of the 35th International Conference on Machine Learning, ICML, volume 80, pages 1690–1699. PMLR, 2018.
  • Hastie [2017] T. J. Hastie. Generalized additive models. In Statistical models in S, pages 249–307. Routledge, 2017.
  • Huang and Hoefler [2023] L. Huang and T. Hoefler. Compressing multidimensional weather and climate data into neural networks. In International Conference on Learning Representations, ICLR, 2023.
  • Hyndman and Athanasopoulos [2018] R. J. Hyndman and G. Athanasopoulos. Forecasting: principles and practice. OTexts, 2018.
  • Jeong and Shin [2022] K. Jeong and Y. Shin. Time-series anomaly detection with implicit neural representation. CoRR, abs/2201.11950, 2022.
  • Kim et al. [2019] T. Kim, W. Ko, and J. Kim. Analysis and impact evaluation of missing data imputation in day-ahead pv generation forecasting. Applied Sciences, 9(1):204, 2019.
  • Liu et al. [2022] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, and S. Dustdar. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
  • Liu et al. [2023] S. Liu, X. Li, G. Cong, Y. Chen, and Y. Jiang. Multivariate time-series imputation with disentangled temporal representations. In The Eleventh International Conference on Learning Representations, ICLR, 2023.
  • Liu et al. [2019] Y. Liu, R. Yu, S. Zheng, E. Zhan, and Y. Yue. Naomi: Non-autoregressive multiresolution sequence imputation. Advances in neural information processing systems, 32, 2019.
  • Luo et al. [2018] Y. Luo, X. Cai, Y. Zhang, J. Xu, et al. Multivariate time series imputation with generative adversarial networks. Advances in neural information processing systems, 31, 2018.
  • Luo et al. [2019] Y. Luo, Y. Zhang, X. Cai, and X. Yuan. E2gan: End-to-end generative adversarial network for multivariate time series imputation. In Proceedings of the 28th international joint conference on artificial intelligence, pages 3094–3100. AAAI Press, 2019.
  • Mildenhall et al. [2021] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • Nichol et al. [2018] A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
  • Nie et al. [2022] Y. Nie, N. H. Nguyen, P. Sinthong, and J. Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. CoRR, abs/2211.14730, 2022.
  • Rasmussen and Williams [2006] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006.
  • Rubanova et al. [2019] Y. Rubanova, R. T. Q. Chen, and D. Duvenaud. Latent odes for irregularly-sampled time series. CoRR, abs/1907.03907, 2019.
  • Schulz and Stattegger [1997] M. Schulz and K. Stattegger. Spectrum: Spectral analysis of unevenly spaced paleoclimatic time series. Computers & Geosciences, 23(9):929–945, 1997.
  • Shukla and Marlin [2021] S. N. Shukla and B. M. Marlin. Multi-time attention networks for irregularly sampled time series. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
  • Sitzmann et al. [2020] V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein. Implicit neural representations with periodic activation functions. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • Tancik et al. [2020] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems, 33:7537–7547, 2020.
  • Tang et al. [2020] X. Tang, H. Yao, Y. Sun, C. C. Aggarwal, P. Mitra, and S. Wang. Joint modeling of local and global temporal dynamics for multivariate time series forecasting with missing values. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI, pages 5956–5963. AAAI Press, 2020.
  • Tashiro et al. [2021] Y. Tashiro, J. Song, Y. Song, and S. Ermon. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems, 34:24804–24816, 2021.
  • Taylor and Letham [2018] S. J. Taylor and B. Letham. Forecasting at scale. The American Statistician, 72(1):37–45, 2018.
  • Woo et al. [2022] G. Woo, C. Liu, D. Sahoo, A. Kumar, and S. C. H. Hoi. Deeptime: Deep time-index meta-learning for non-stationary time-series forecasting. CoRR, abs/2207.06046, 2022.
  • Wu et al. [2021] H. Wu, J. Xu, J. Wang, and M. Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 22419–22430, 2021.
  • Yin et al. [2023] Y. Yin, M. Kirchmeyer, J.-Y. Franceschi, A. Rakotomamonjy, and P. Gallinari. Continuous pde dynamics forecasting with implicit neural representations. In International Conference on Learning Representations, ICLR, 2023.
  • Zeng et al. [2022] A. Zeng, M. Chen, L. Zhang, and Q. Xu. Are transformers effective for time series forecasting? CoRR, abs/2205.13504, 2022.
  • Zhou et al. [2021] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, pages 11106–11115, 2021.
  • Zhou et al. [2022] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, editors, International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 27268–27286. PMLR, 2022.
  • Zintgraf et al. [2019] L. Zintgraf, K. Shiarli, V. Kurin, K. Hofmann, and S. Whiteson. Fast context adaptation via meta-learning. In International Conference on Machine Learning, pages 7693–7702. PMLR, 2019.

Appendix A Reproductiblity statement

Our work is entirely reproducible, and all the references to the information in order to reproduce it are in this section.

Code.

The code for all our experiments is available at this link.

Data.

A subset of the processed data is available with the code at this link. The dataset description, processing and normalization are presented in Appendix C.

Model.

The model and the training details are presented in Section 3 and the hyperparameter selection is available in Section B.1.

GPU.

We used NVIDIA TITAN RTX 24Go single GPU to conduct all the experiments for our method, which is coded in PyTorch (Python 3.9.2).

Appendix B Architecture details and ablation studies

B.1 Architecture details

For all imputation and forecasting experiments we choose the following hyperparameters :

  • z𝑧zitalic_z dimension: 128

  • Number of layers: 5

  • Hidden layers dimension: 256

  • γ(t)2×64𝛾𝑡superscript264\gamma(t)\in\mathbb{R}^{2\times 64}italic_γ ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 64 end_POSTSUPERSCRIPT

  • z𝑧zitalic_z code learning rate (α𝛼\alphaitalic_α in Algorithm 1): 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT

  • Hypernetwork and INR learning rate: 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT

  • Number of steps in inner loop: K=3𝐾3K=3italic_K = 3

  • Number of epochs: 4×1044superscript1044\times 10^{4}4 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT

  • Batch size: 64

It is worth noting that the hyperparameters mentioned above remain consistent across all experiments conducted in the paper. We chose to maintain a fixed set of hyperparameters for our model, while other imputation and forecasting approaches commonly fine-tune hyperparameters based on a validation dataset. The obtained results exhibit high robustness across various settings, suggesting that the selected hyperparameters are already effective in achieving reliable outcomes.

B.2 Ablation studies

B.2.1 Fourier features vs SIREN on imputation task

Baseline

The SIREN network differs from the Fourier features network because it does not explicitly incorporate frequencies as input. Instead, it is a multi-layer perceptron network that utilizes sine activation functions. An adjustable parameter, denoted ω0subscript𝜔0\omega_{0}italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, is multiplied with the input matrices of the preceding layers to capture a broader range of frequencies. For this comparison, we adopt the same hyperparameters described in Section B.1, selecting ω0=30subscript𝜔030\omega_{0}=30italic_ω start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 30 to align with Sitzmann et al. [2020]. Furthermore, we set the learning rate of both the hypernetwork and the INR to 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT to enhance training stability. In Table 4, we compare the imputation results obtained by the Fourier features network and the SIREN network, specifically focusing on the first time window from the Electricity, Traffic and Solar datasets.

Table 4: MAE imputation errors on the first time window of each dataset. Best results are bold.
τ𝜏\tauitalic_τ TimeFlow TimeFlow w SIREN
Electricity 0.05 0.323 0.466
0.10 0.252 0.350
0.20 0.224 0.242
0.30 0.211 0.222
0.50 0.194 0.209
Solar 0.05 0.105 0.114
0.10 0.083 0.094
0.20 0.065 0.079
0.30 0.061 0.072
0.50 0.056 0.066
Traffic 0.05 0.292 0.333
0.10 0.220 0.252
0.20 0.168 0.191
0.30 0.152 0.163
0.50 0.141 0.154
Results

According to the results presented in Table 4, the Fourier features network outperforms the SIREN network in the imputation task on these datasets. Notably, the performance gap between the two network architectures are more pronounced at low sampling rates. This disparity can be attributed to the SIREN network’s difficulty in accurately capturing high frequencies when the time series is sparsely observed. We hypothesize that the MLP with ReLU activations correctly learns the different frequencies of time series with multi-temporal patterns by switching on or off the Fourier embedding frequencies.

B.2.2 Influence of the latent code dimension

The dimension of the latent code z𝑧zitalic_z is a crucial parameter in our architecture. If it is too small, it underfits the time series. Consequently, this adversely affects the performance of both the imputation and forecasting tasks. On the other hand, if the dimension of z𝑧zitalic_z is too large, it can lead to overfitting, hindering the model’s ability to generalize to new data points.

Baselines

To investigate the impact of z𝑧zitalic_z dimensionality on the performance of TimeFlow, we conducted experiments on the three considered datasets, specifically focusing on the forecasting task. We varied the sizes of z𝑧zitalic_z within {32,64,128,256}3264128256\{32,64,128,256\}{ 32 , 64 , 128 , 256 }. The other hyperparameters are set as presented in Section B.1. The obtained results for each z𝑧zitalic_z dimension are summarized in Table 5.

Table 5: MAE error for different z𝑧zitalic_z dimension.
H 32 64 128 256
Electricity 96 0.232 ±plus-or-minus\pm± 0.016 0.222 ±plus-or-minus\pm± 0.017 0.222 ±plus-or-minus\pm± 0.018 0.215 ±plus-or-minus\pm± 0.019
192 0.245 ±plus-or-minus\pm± 0.020 0.239 ±plus-or-minus\pm± 0.018 0.230 ±plus-or-minus\pm± 0.026 0.233 ±plus-or-minus\pm± 0.017
336 0.254 ±plus-or-minus\pm± 0.029 0.244 ±plus-or-minus\pm± 0.028 0.262 ±plus-or-minus\pm± 0.031 0.243 ±plus-or-minus\pm± 0.032
720 0.295 ±plus-or-minus\pm± 0.027 0.284 ±plus-or-minus\pm± 0.028 0.303 ±plus-or-minus\pm± 0.041 0.283 ±plus-or-minus\pm± 0.029
SolarH 96 0.182 ±plus-or-minus\pm± 0.009 0.181 ±plus-or-minus\pm± 0.012 0.179 ±plus-or-minus\pm± 0.003 0.225 ±plus-or-minus\pm± 0.047
192 0.195 ±plus-or-minus\pm± 0.014 0.195 ±plus-or-minus\pm± 0.016 0.193 ±plus-or-minus\pm± 0.015 0.197 ±plus-or-minus\pm± 0.029
336 0.181 ±plus-or-minus\pm± 0.011 0.182 ±plus-or-minus\pm± 0.011 0.189 ±plus-or-minus\pm± 0.013 0.183 ±plus-or-minus\pm± 0.012
720 0.201 ±plus-or-minus\pm± 0.027 0.199 ±plus-or-minus\pm± 0.025 0.209 ±plus-or-minus\pm± 0.029 0.200 ±plus-or-minus\pm± 0.030
Traffic 96 0.223 ±plus-or-minus\pm± 0.024 0.215 ±plus-or-minus\pm± 0.028 0.215 ±plus-or-minus\pm± 0.037 0.210 ±plus-or-minus\pm± 0.033
192 0.214 ±plus-or-minus\pm± 0.018 0.217 ±plus-or-minus\pm± 0.025 0.206 ±plus-or-minus\pm± 0.023 0.203 ±plus-or-minus\pm± 0.024
336 0.238 ±plus-or-minus\pm± 0.029 0.231 ±plus-or-minus\pm± 0.029 0.226 ±plus-or-minus\pm± 0.030 0.229 ±plus-or-minus\pm± 0.029
720 0.272 ±plus-or-minus\pm± 0.040 0.269 ±plus-or-minus\pm± 0.035 0.259 ±plus-or-minus\pm± 0.038 0.262 ±plus-or-minus\pm± 0.040
Results

The results presented in Table 5 suggest that a z𝑧zitalic_z dimension of 128 is a reasonable compromise but only optimal for some settings. Moreover, even though the choice of z𝑧zitalic_z dimension seems important, it doesn’t critically impact the MAE error for the forecasting task.

B.2.3 Influence of the number of gradient steps

As can be seen in Table 6, using three gradient steps at inference yield an inference time of less than 0.2 seconds. The latter can still be reduced by doing only one step at the cost of an increase in the forecasting error. As observed in Table 6, increasing the number of gradient steps above 3 steps during inference does not improve forecasting performance.

Table 6: Inference time (in seconds) and MAE error on the forecasting task on the Electricity dataset for a horizon of length 720, a look-back window of length 512, and a varying number of adaptation gradient steps. The statistics are computed over 10 runs using an NVIDIA TITAN RTX GPU.
Gradient descent steps 1 3 10 50 500 5000
Inference time (s) 0.109 ±plus-or-minus\pm± 0.003 0.176 ±plus-or-minus\pm± 0.009 0.427 ±plus-or-minus\pm± 0.031 3.547 ±plus-or-minus\pm± 0.135 17.722 ±plus-or-minus\pm± 0.536 189.487 ±plus-or-minus\pm± 8.060
MAE 0.351 ±plus-or-minus\pm± 0.038 0.303 ±plus-or-minus\pm± 0.041 0.300 ±plus-or-minus\pm± 0.040 0.299 ±plus-or-minus\pm± 0.039 0.302 ±plus-or-minus\pm± 0.038 0.308 ±plus-or-minus\pm± 0.037
Table 7: MAE error on the forecasting task using 1 inner-step during training and a varying number of adaptation gradient steps at inference. Best results are in bold and /// symbol means that the MAE score is very high (1absent1\geq 1≥ 1).
H 1 3 10 50
Electricity 96 0.244 ±plus-or-minus\pm± 0.017 0.246 ±plus-or-minus\pm± 0.017 0.261 ±plus-or-minus\pm± 0.016 /
192 0.253 ±plus-or-minus\pm± 0.024 0.253 ±plus-or-minus\pm± 0.022 0.261 ±plus-or-minus\pm± 0.020 0.265 ±plus-or-minus\pm± 0.019
336 0.267 ±plus-or-minus\pm± 0.032 0.268 ±plus-or-minus\pm± 0.030 0.277 ±plus-or-minus\pm± 0.028 0.281 ±plus-or-minus\pm± 0.027
720 0.302 ±plus-or-minus\pm± 0.030 0.306 ±plus-or-minus\pm± 0.029 0.310 ±plus-or-minus\pm± 0.028 0.301 ±plus-or-minus\pm± 0.029
SolarH 96 0.192 ±plus-or-minus\pm± 0.023 0.623 ±plus-or-minus\pm± 0.397 / /
192 0.175 ±plus-or-minus\pm± 0.006 0.252 ±plus-or-minus\pm± 0.068 / /
336 0.192 ±plus-or-minus\pm± 0.016 0.471 ±plus-or-minus\pm± 0.029 / /
720 0.216 ±plus-or-minus\pm± 0.034 0.465 ±plus-or-minus\pm± 0.063 / 0.550 ±plus-or-minus\pm± 0.187
Traffic 96 0.215 ±plus-or-minus\pm± 0.029 0.329 ±plus-or-minus\pm± 0.039 / /
192 0.208 ±plus-or-minus\pm± 0.019 0.310 ±plus-or-minus\pm± 0.033 0.312 ±plus-or-minus\pm± 0.032 /
336 0.237 ±plus-or-minus\pm± 0.028 0.307 ±plus-or-minus\pm± 0.038 / /
720 0.263 ±plus-or-minus\pm± 0.038 0.320 ±plus-or-minus\pm± 0.040 / /
Table 8: MAE error on the forecasting task using 3 inner-steps during training and a varying number of adaptation gradient steps at inference. Best results are in bold.
H 1 3 10 50
Electricity 96 0.259 ±plus-or-minus\pm± 0.020 0.222 ±plus-or-minus\pm± 0.018 0.222 ±plus-or-minus\pm± 0.017 0.228 ±plus-or-minus\pm± 0.019
192 0.269 ±plus-or-minus\pm± 0.020 0.230 ±plus-or-minus\pm± 0.026 0.232 ±plus-or-minus\pm± 0.020 0.233 ±plus-or-minus\pm± 0.026
336 0.273 ±plus-or-minus\pm± 0.033 0.262 ±plus-or-minus\pm± 0.031 0.264 ±plus-or-minus\pm± 0.032 0.268 ±plus-or-minus\pm± 0.032
720 0.351 ±plus-or-minus\pm± 0.038 0.303 ±plus-or-minus\pm± 0.041 0.300 ±plus-or-minus\pm± 0.040 0.299 ±plus-or-minus\pm± 0.039
SolarH 96 0.487 ±plus-or-minus\pm± 0.196 0.179 ±plus-or-minus\pm± 0.003 0.181 ±plus-or-minus\pm± 0.003 0.186 ±plus-or-minus\pm± 0.003
192 0.411 ±plus-or-minus\pm± 0.088 0.193 ±plus-or-minus\pm± 0.015 0.195 ±plus-or-minus\pm± 0.014 0.199 ±plus-or-minus\pm± 0.013
336 0.435 ±plus-or-minus\pm± 0.153 0.189 ±plus-or-minus\pm± 0.013 0.203 ±plus-or-minus\pm± 0.006 0.223 ±plus-or-minus\pm± 0.012
720 0.394 ±plus-or-minus\pm± 0.173 0.209 ±plus-or-minus\pm± 0.029 0.203 ±plus-or-minus\pm± 0.006 0.209 ±plus-or-minus\pm± 0.027
Traffic 96 0.320 ±plus-or-minus\pm± 0.038 0.215 ±plus-or-minus\pm± 0.037 0.219 ±plus-or-minus\pm± 0.043 0.226 ±plus-or-minus\pm± 0.046
192 0.299 ±plus-or-minus\pm± 0.023 0.206 ±plus-or-minus\pm± 0.023 0.209 ±plus-or-minus\pm± 0.026 0.214 ±plus-or-minus\pm± 0.027
336 0.345 ±plus-or-minus\pm± 0.038 0.226 ±plus-or-minus\pm± 0.030 0.228 ±plus-or-minus\pm± 0.031 0.233 ±plus-or-minus\pm± 0.032
720 0.321 ±plus-or-minus\pm± 0.034 0.259 ±plus-or-minus\pm± 0.038 0.260 ±plus-or-minus\pm± 0.038 0.266 ±plus-or-minus\pm± 0.039
Table 9: MAE error on the forecasting task using 10 inner-steps during training and a varying number of adaptation gradient steps at inference. Best results are in bold.
H 1 3 10 50
Electricity 96 0.381 ±plus-or-minus\pm± 0.030 0.249 ±plus-or-minus\pm± 0.024 0.236 ±plus-or-minus\pm± 0.024 0.238 ±plus-or-minus\pm± 0.024
192 0.448 ±plus-or-minus\pm± 0.045 0.273 ±plus-or-minus\pm± 0.019 0.244 ±plus-or-minus\pm± 0.014 0.244 ±plus-or-minus\pm± 0.013
336 0.514 ±plus-or-minus\pm± 0.053 0.283 ±plus-or-minus\pm± 0.033 0.241 ±plus-or-minus\pm± 0.025 0.242 ±plus-or-minus\pm± 0.024
720 0.647 ±plus-or-minus\pm± 0.068 0.400 ±plus-or-minus\pm± 0.051 0.286 ±plus-or-minus\pm± 0.023 0.287 ±plus-or-minus\pm± 0.021
SolarH 96 0.605 ±plus-or-minus\pm± 0.029 0.380 ±plus-or-minus\pm± 0.018 0.188 ±plus-or-minus\pm± 0.012 0.199 ±plus-or-minus\pm± 0.015
192 0.382 ±plus-or-minus\pm± 0.072 0.250 ±plus-or-minus\pm± 0.012 0.202 ±plus-or-minus\pm± 0.034 0.204 ±plus-or-minus\pm± 0.035
336 0.745 ±plus-or-minus\pm± 0.105 0.431 ±plus-or-minus\pm± 0.221 0.201 ±plus-or-minus\pm± 0.033 0.208 ±plus-or-minus\pm± 0.032
720 0.745 ±plus-or-minus\pm± 0.082 0.477 ±plus-or-minus\pm± 0.039 0.205 ±plus-or-minus\pm± 0.030 0.205 ±plus-or-minus\pm± 0.029
Traffic 96 0.450 ±plus-or-minus\pm± 0.023 0.273 ±plus-or-minus\pm± 0.026 0.225 ±plus-or-minus\pm± 0.028 0.230 ±plus-or-minus\pm± 0.034
192 0.506 ±plus-or-minus\pm± 0.028 0.318 ±plus-or-minus\pm± 0.021 0.233 ±plus-or-minus\pm± 0.022 0.236 ±plus-or-minus\pm± 0.026
336 0.500 ±plus-or-minus\pm± 0.042 0.320 ±plus-or-minus\pm± 0.021 0.247 ±plus-or-minus\pm± 0.028 0.249 ±plus-or-minus\pm± 0.031
720 0.511 ±plus-or-minus\pm± 0.035 0.323 ±plus-or-minus\pm± 0.022 0.266 ±plus-or-minus\pm± 0.027 0.272 ±plus-or-minus\pm± 0.024
Results

We conduct more extensive experiments in Table 7, Table 8, Table 9 to quantify the MAE score variation according to different number of gradient steps during training and inference. The tables show that using the same number of steps in training and inference leads to better results. This is expected since using different gradient steps for training and inference makes the inference model slightly different from the training model. In addition, using 3 gradient steps instead of 1 clearly improves the performances, but using 10 instead of 3 does not. Indeed, it usually leads to overall better results for longer horizon, but the gain is not clear for smaller horizons. Hence using 3 gradient steps is a suitable choice.

B.2.4 TimeFlow variants with other meta-learning techniques

Baselines

Before converging to the current architecture and optimization of TimeFlow, we explored different options to condition the INR with the observations. The first one was inspired by the neural process architecture, which uses a set encoder to transform a set of observations (ti,xti)isubscriptsubscript𝑡𝑖subscript𝑥subscript𝑡𝑖𝑖(t_{i},x_{t_{i}})_{i\in\mathcal{I}}( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT into a latent code z𝑧zitalic_z by applying a pooling layer after a feed forward network. We observed that this encoder in combination with the modulated fourier features network was able to achieve relatively good results on the forecasting task but suffered of underfitting on more complex datasets such as Electricity.

This led us to consider auto-decoding methods instead, i.e. encoder-less, architectures for conditioning the weights of the coordinate-based network. We trained TimeFlow with the REPTILE algorithm [Nichol et al., 2018], which is a first-order meta-learning technique that adapts the code in a few steps of gradient descent. In contrast with a second-order method, we observed that REPTILE was less costly to train but struggled to escape sub optimal minima, which led to unstable training and underfitting.

From an implementation point of view, the only difference between second order and first order, is that in the latter the code is detached from the computation graph before taking the outer-loop parameter update. When the code is not detached, it remains a function of the common parameters z=z(θ,w)𝑧𝑧𝜃𝑤z=z(\theta,w)italic_z = italic_z ( italic_θ , italic_w ), which means that the computation graph for the outer-loop also includes the inner-loop updates to the codes. Therefore the outer-loop gradient update involves a gradient through a gradient and requires an additional backward pass through the INR to compute the Hessian. Please refer to Finn et al. [2017] for more technical details.

Table 10: Comparison of second-order and first-order (REPTILE) meta learning for TimeFlow on the imputation task. Mean MAE results on the missing grid over five different time windows. τ𝜏\tauitalic_τ stands for the subsampling rate. Bold results are best.
τ𝜏\tauitalic_τ TimeFlow TimeFlow w REPTILE
0.05 0.324 ±plus-or-minus\pm± 0.013 0.363 ±plus-or-minus\pm± 0.062
0.10 0.250 ±plus-or-minus\pm± 0.010 0.343 ±plus-or-minus\pm± 0.036
Electricity 0.20 0.225 ±plus-or-minus\pm± 0.008 0.312 ±plus-or-minus\pm± 0.043
0.30 0.212 ±plus-or-minus\pm± 0.007 0.308 ±plus-or-minus\pm± 0.035
0.50 0.194 ±plus-or-minus\pm± 0.007 0.305 ±plus-or-minus\pm± 0.046
0.05 0.095 ±plus-or-minus\pm± 0.015 0.125 ±plus-or-minus\pm± 0.025
0.10 0.083 ±plus-or-minus\pm± 0.015 0.123 ±plus-or-minus\pm± 0.032
Solar 0.20 0.072 ±plus-or-minus\pm± 0.015 0.108 ±plus-or-minus\pm± 0.021
0.30 0.061 ±plus-or-minus\pm± 0.012 0.105 ±plus-or-minus\pm± 0.027
0.50 0.054 ±plus-or-minus\pm± 0.013 0.102 ±plus-or-minus\pm± 0.021
0.05 0.283 ±plus-or-minus\pm± 0.016 0.304 ±plus-or-minus\pm± 0.026
0.10 0.211 ±plus-or-minus\pm± 0.012 0.264 ±plus-or-minus\pm± 0.009
Traffic 0.20 0.168 ±plus-or-minus\pm± 0.006 0.242 ±plus-or-minus\pm± 0.019
0.30 0.151 ±plus-or-minus\pm± 0.007 0.218 ±plus-or-minus\pm± 0.020
0.50 0.139 ±plus-or-minus\pm± 0.007 0.216 ±plus-or-minus\pm± 0.017
Results

In Table 10, we show the performance of first-order TimeFlow on the imputation task. In low sampling regimes the difference with TimeFlow is less perceptive, but its performance plateaus when the number of points increases. This is not surprising. Indeed, as though the task is actually simpler when τ𝜏\tauitalic_τ increases, the optimization is made more difficult with the increased number of observations. We provide the performance of TimeFlow with a set encoder on the Forecasting task in Table 11. We observed that this version failed to generalize well for complex datasets.

Table 11: Comparison of optimization-based and set-encoder-based meta learning for TimeFlow on the forecasting task. Mean MAE forecast results over different time windows. H𝐻Hitalic_H stands for the horizon. Bold results are best.
H𝐻Hitalic_H TimeFlow TimeFlow w set encoder
96 0.228 ±plus-or-minus\pm± 0.026 0.362 ±plus-or-minus\pm± 0.032
192 0.238 ±plus-or-minus\pm± 0.020 0.360 ±plus-or-minus\pm± 0.028
Electricity 336 0.270 ±plus-or-minus\pm± 0.031 0.382 ±plus-or-minus\pm± 0.038
720 0.316 ±plus-or-minus\pm± 0.055 0.431 ±plus-or-minus\pm± 0.059
96 0.190 ±plus-or-minus\pm± 0.013 0.251 ±plus-or-minus\pm± 0.071
192 0.202 ±plus-or-minus\pm± 0.020 0.239 ±plus-or-minus\pm± 0.058
SolarH 336 0.209 ±plus-or-minus\pm± 0.017 0.235 ±plus-or-minus\pm± 0.040
720 0.218 ±plus-or-minus\pm± 0.048 0.231 ±plus-or-minus\pm± 0.032
96 0.217 ±plus-or-minus\pm± 0.036 0.276 ±plus-or-minus\pm± 0.031
192 0.212 ±plus-or-minus\pm± 0.028 0.281 ±plus-or-minus\pm± 0.034
Traffic 336 0.238 ±plus-or-minus\pm± 0.034 0.297 ±plus-or-minus\pm± 0.042
720 0.279 ±plus-or-minus\pm± 0.050 0.333 ±plus-or-minus\pm± 0.048

B.2.5 Influence of the modulation

In TimeFlow, we apply shift modulations to the parameters of the INR, i.e. for each layer l𝑙litalic_l we only modify the biases of the network with an extra bias term ϕl(j)superscriptsubscriptitalic-ϕ𝑙𝑗\phi_{l}^{(j)}italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT. We generate these bias terms with a linear hypernetwork that maps the code z(j)superscript𝑧𝑗z^{(j)}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT to the modulations. The output of the l𝑙litalic_l-th layer of the modulated INR is thus given by ϕl+1=ReLU(θlϕl1+bl+ψl(j))subscriptitalic-ϕ𝑙1ReLUsubscript𝜃𝑙subscriptitalic-ϕ𝑙1subscript𝑏𝑙superscriptsubscript𝜓𝑙𝑗\phi_{l+1}=\text{ReLU}(\theta_{l}\phi_{l-1}+b_{l}+\psi_{l}^{(j)})italic_ϕ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT = ReLU ( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ψ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ), where ψl(j)=Wlz(j)superscriptsubscript𝜓𝑙𝑗subscript𝑊𝑙superscript𝑧𝑗\psi_{l}^{(j)}=W_{l}z^{(j)}italic_ψ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT and (Wl)l=1Lsuperscriptsubscriptsubscript𝑊𝑙𝑙1𝐿(W_{l})_{l=1}^{L}( italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT are parameters of the hypernetwork. However, another common modulation is the combination of the scale and shift modulation, which leads to the output of the l𝑙litalic_l-th layer of the modulated INR being given by ϕl+1=ReLU((Slz(j))(θlϕl1+bl)+ψl(j))subscriptitalic-ϕ𝑙1ReLUsubscript𝑆𝑙superscript𝑧𝑗subscript𝜃𝑙subscriptitalic-ϕ𝑙1subscript𝑏𝑙superscriptsubscript𝜓𝑙𝑗\phi_{l+1}=\text{ReLU}((S_{l}z^{(j)})\circ(\theta_{l}\phi_{l-1}+b_{l})+\psi_{l% }^{(j)})italic_ϕ start_POSTSUBSCRIPT italic_l + 1 end_POSTSUBSCRIPT = ReLU ( ( italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ∘ ( italic_θ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) + italic_ψ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ), where ψl(j)=Wlz(j)superscriptsubscript𝜓𝑙𝑗subscript𝑊𝑙superscript𝑧𝑗\psi_{l}^{(j)}=W_{l}z^{(j)}italic_ψ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, and (Wl)l=1Lsuperscriptsubscriptsubscript𝑊𝑙𝑙1𝐿(W_{l})_{l=1}^{L}( italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT and (Sl)l=1Lsuperscriptsubscriptsubscript𝑆𝑙𝑙1𝐿(S_{l})_{l=1}^{L}( italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT are parameters of the hypernetwork and \circ is the Hadamard product.

In Table 12, we conduct additional experiments on the Electricity dataset in the forecasting setting with different time horizons. In these experiments, we compare two scenarios: one where the INR is modulated only by a shift factor and the other where the INR is modulated by both a shift and a scale factor. We kept the architecture and hyperparameters consistent with those described in Section B.1. The experiments shown in Table 12 indicate that the INR is longer to train with shift and scale modulations due to the increased number of parameters involved. Furthermore, we observe that the shift and scale modulated INR performed similarly or even worse than the INR with only shift modulation. These two drawbacks, namely an increased computational time and similar or worse performances, motivate modulating the INR only by a shift factor.

Table 12: Ablation on modulations for the forecasting task on Electricity dataset for different horizons. Models are trained on a given time window and tested on four new time windows. Models are trained on a single NVIDIA TITAN RTX GPU.
96 192 336 720
MAE
Training time
MAE
Training time
MAE
Training time
MAE
Training time
Shift
0.233 ±plus-or-minus\pm± 0.014 2h30 0.245 ±plus-or-minus\pm± 0.016 2h31 0.264 ±plus-or-minus\pm± 0.020 2h33 0.303 ±plus-or-minus\pm± 0.041 2h46
Shift and scale
0.257 ±plus-or-minus\pm± 0.019 3h29 0.263 ±plus-or-minus\pm± 0.014 3h32 0.268 ±plus-or-minus\pm± 0.025 3h45 0.308 ±plus-or-minus\pm± 0.037 4h14

B.2.6 Discussion on other hyperparameters

While the dimension of z𝑧zitalic_z is indeed a crucial hyperparameter, it is important to note that other hyperparameters also play a significant role in the performance of the INR. For example, the number of layers in the FFN directly affects the ability of the model to fit the time series. In our experiments, we have observed that using five or more layers yields good performance, and including additional layers can lead to slight improvements in the generalization settings.

Similarly, the number of frequencies used in the frequency embedding is another important hyperparameter. Using too few frequencies can limit the network’s ability to capture patterns, while using too many frequencies can hinder its ability to generalize accurately.

The choice of learning rate is critical for achieving stable convergence during training. Therefore, in practice, we use a low learning rate combined with a cosine annealing scheduler to ensure stable and effective training.

Appendix C Datasets, scores and normalization

For the complete datasets, Electricity dataset is available here, Traffic dataset here and Solar data set here. Table 13 provides a concise overview of the main information about the datasets used for forecasting and imputation tasks.

Table 13: Summary of datasets information
Dataset name Number of samples Number of time steps Sampling frequency Location Years
Electricity 321 26 304 hourly Portugal 20122014201220142012-20142012 - 2014
Traffic 862 17 544 hourly San Francisco bay 20152016201520162015-20162015 - 2016
Solar 137 52 560 10 minutes Alabama 2006200620062006
SolarH 137 8 760 hourly Alabama 2006200620062006
How TimeFlow relative improvement score is computed?

In many paper tables, the TimeFlow improvement score appears in the last row of the table. Its purpose is to quantify the average marginal gain of TimeFlow over the method under consideration. It is computed as follows:

TimeFlow improvement=1Ll=1Lsl(baseline)sl(TimeFlow)sl(baseline)TimeFlow improvement1𝐿superscriptsubscript𝑙1𝐿subscript𝑠𝑙baselinesubscript𝑠𝑙TimeFlowsubscript𝑠𝑙baseline\text{TimeFlow improvement}=\frac{1}{L}\sum_{l=1}^{L}\frac{{s}_{l}(\text{% baseline})-{s}_{l}(\text{TimeFlow})}{{s}_{l}(\text{baseline})}TimeFlow improvement = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT divide start_ARG italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( baseline ) - italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( TimeFlow ) end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( baseline ) end_ARG
  • s stands for the Mean Absolute Error score of the considered method against the ground truth at line l (for instance in Table 1, s1subscript𝑠1{s}_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT(TimeFlow)=0.324absent0.324=0.324= 0.324, s2subscript𝑠2{s}_{2}italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(TimeFlow)=0.250absent0.250=0.250= 0.250 etc).

  • L stands for the number of line in the table (for instance 15 in Table 1).

z-normalization

To preprocess each dataset, we apply the widely used z-normalization technique per-sample j𝑗jitalic_j on the entire series: xnorm(j)=x(j)mean(x(j))std(x(j)).subscriptsuperscript𝑥𝑗𝑛𝑜𝑟𝑚superscript𝑥𝑗meansuperscript𝑥𝑗stdsuperscript𝑥𝑗x^{(j)}_{norm}=\frac{x^{(j)}-\text{mean}(x^{(j)})}{\text{std}(x^{(j)})}.italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m end_POSTSUBSCRIPT = divide start_ARG italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT - mean ( italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) end_ARG start_ARG std ( italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) end_ARG .

Appendix D Imputation experiments

D.1 Baselines details

D.1.1 Baselines training and hyperparameters

The baselines underwent meticulous training and extensive testing, involving thorough exploration of hyperparameters. We used SAITS repository (code) for BRITS and SAITS. The adopted setting results from a hyperparameters search, which yields marginally superior results for both methods compared to the default settings. The marginal difference in scores underscores the robustness of BRITS and SAITS. In addition, mTAN and TIDER did not perform optimally with the recommended configurations, requiring an extensive search. Details of the parameters explored are provided in Table 14 and Table 15. While the hyperparameter search led to performance improvements, overall results remained sub-optimal. For CSDI, the recommended settings proved inadequate for the considered datasets, prompting a comprehensive search. Among various parameters, the number of diffusion steps emerged as crucial, significantly enhancing performance, particularly at higher draw ratios. However, superior performance was attained with more diffusion steps, albeit at increased computational cost. The chosen parameters are detailed in Table 16. In addition, the original DeepTime implementation did not perform well on imputation. Hence, we adapted the DeepTime training procedure (see Section D.3). Lastly, the vanilla Neural Process baseline underperformed, so we customized its architecture to conduct a fair comparison with TimeFlow. We used the INR and hypernetwork from TimeFlow to align the Neural Process with our temporal frequency bias and shift modulation technique.

Table 14: mTAN hyperparameter search.
Dimension size γ𝛾\gammaitalic_γ linear scheduler lr NumRefPoints k-iwae Target ratio
50 1 1×1051superscript1051\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT 32 5 0.2
100 0.95 0.0001 64 10 0.8
- 0.5 0.001 128 - -
- 0.1 0.005 - - -
Table 15: TIDER hyperparameter search.
Dimension size λarsubscript𝜆𝑎𝑟\lambda_{ar}italic_λ start_POSTSUBSCRIPT italic_a italic_r end_POSTSUBSCRIPT λtrendsubscript𝜆𝑡𝑟𝑒𝑛𝑑\lambda_{trend}italic_λ start_POSTSUBSCRIPT italic_t italic_r italic_e italic_n italic_d end_POSTSUBSCRIPT lr Season number
50 0.1 0.01 0.0001 2
100 0.2 0.1 0.001 10
- - - 0.005 15
- - - - 20
Table 16: CSDI chosen hyperparameters.
Epochs lr Layers Channels Nheads Diffusion embedding dimension NSteps Schedule Time embedding Feature embedding
5000 0.001 4 64 8 128 100 Quad 128 16

D.1.2 Models complexity

We can see in Table 17 that our method has 10 times less parameters than BRITS and 20 times less than SAITS. It is mainly due to their modelisation of interaction between samples. SAITS, which is based on transformers has the highest number of parameters when mTAN has the lowest number of parameters.

Table 17: Number of parameters for each DL methods on the imputation task on the Electricity dataset.
TimeFlow DeepTime NeuralProcess mTAN SAITS BRITS TIDER
Number of parameters 602k 1315k 248k 113k 11 137k 6 220k 1 034k

D.2 Imputation for previously unseen time series

Setting

In this section we analyze in details the imputations results for previously unseen time series described in Section 4.1. Specifically, TimeFlow is trained on a given set of time series within a defined time window and then used for inference on new time series. We train TimeFlow on 50 % of the samples and consider the remaining 50 % as the new time series.

We compare in Table 18 observed grid fit scores and missing grid inference scores for time series known at training and time series unknown at training.

Table 18: TimeFlow MAE imputation errors results for imputation previsouly unseen time series.
Known time series New time series
τ𝜏\tauitalic_τ Fit Inference Fit Inference
Electricity 0.05 0.060 ±plus-or-minus\pm± 0.010 0.402 ±plus-or-minus\pm± 0.021 0.142 ±plus-or-minus\pm± 0.083 0.413 ±plus-or-minus\pm± 0.026
0.10 0.046 ±plus-or-minus\pm± 0.006 0.302 ±plus-or-minus\pm± 0.010 0.144 ±plus-or-minus\pm± 0.098 0.309 ±plus-or-minus\pm± 0.016
0.20 0.067 ±plus-or-minus\pm± 0.015 0.285 ±plus-or-minus\pm± 0.014 0.154 ±plus-or-minus\pm± 0.089 0.291 ±plus-or-minus\pm± 0.022
0.30 0.093 ±plus-or-minus\pm± 0.022 0.266 ±plus-or-minus\pm± 0.010 0.163 ±plus-or-minus\pm± 0.073 0.271 ±plus-or-minus\pm± 0.017
0.50 0.108 ±plus-or-minus\pm± 0.012 0.236 ±plus-or-minus\pm± 0.010 0.167 ±plus-or-minus\pm± 0.061 0.245 ±plus-or-minus\pm± 0.017
Solar 0.05 0.014 ±plus-or-minus\pm± 0.002 0.104 ±plus-or-minus\pm± 0.015 0.050 ±plus-or-minus\pm± 0.037 0.109 ±plus-or-minus\pm± 0.016
0.10 0.017 ±plus-or-minus\pm± 0.002 0.092 ±plus-or-minus\pm± 0.015 0.052 ±plus-or-minus\pm± 0.036 0.099 ±plus-or-minus\pm± 0.017
0.20 0.028 ±plus-or-minus\pm± 0.008 0.078 ±plus-or-minus\pm± 0.014 0.058 ±plus-or-minus\pm± 0.031 0.089 ±plus-or-minus\pm± 0.017
0.30 0.038 ±plus-or-minus\pm± 0.009 0.072 ±plus-or-minus\pm± 0.013 0.063 ±plus-or-minus\pm± 0.028 0.084 ±plus-or-minus\pm± 0.018
0.50 0.045 ±plus-or-minus\pm± 0.011 0.066 ±plus-or-minus\pm± 0.013 0.067 ±plus-or-minus\pm± 0.025 0.080 ±plus-or-minus\pm± 0.019
Traffic 0.05 0.044 ±plus-or-minus\pm± 0.003 0.291 ±plus-or-minus\pm± 0.013 094 ±plus-or-minus\pm± 0.051 0.291 ±plus-or-minus\pm± 0.012
0.10 0.033 ±plus-or-minus\pm± 0.001 0.209 ±plus-or-minus\pm± 0.010 0.093 ±plus-or-minus\pm± 0.060 0.216 ±plus-or-minus\pm± 0.012
0.20 0.037 ±plus-or-minus\pm± 0.006 0.175 ±plus-or-minus\pm± 0.008 0.095 ±plus-or-minus\pm± 0.058 0.186 ±plus-or-minus\pm± 0.013
0.30 0.048 ±plus-or-minus\pm± 0.005 0.164 ±plus-or-minus\pm± 0.006 0.098 ±plus-or-minus\pm± 0.051 0.175 ±plus-or-minus\pm± 0.013
0.50 0.068 ±plus-or-minus\pm± 0.004 0.159 ±plus-or-minus\pm± 0.007 0.110 ±plus-or-minus\pm± 0.042 0.169 ±plus-or-minus\pm± 0.012
Results

The results presented in Table 18 indicate that the inference MAE for missing grids shows consistency between known and new samples, regardless of the data or sampling rate. However, it is worth noting that there is a slight drop in performance compared to the results in table Table 1. This decrease is because in Table 18, the shared architecture is trained on only half the samples, affecting its overall performance.

D.3 Details on DeepTime adaptation for imputation

As DeepTime was proposed to address the forecasting task with a deeptime-index model, the authors did not tackle the task of imputation and left it out for future work. Given the success of this method and the motivation of our work, we wanted to explore its capabilities to impute time series with several subsampling rates. Following our current framework, we first tried to train the model in a self-supervised way, i.e. trying to reconstruct observations x(j)𝒯(j)superscript𝑥𝑗superscript𝒯𝑗x^{(j)}\in\mathcal{T}^{(j)}italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT after the INR has been conditioned with the Ridge Regressor on the same set of observations, but discovered failure cases for τ0.20𝜏0.20\tau\leq 0.20italic_τ ≤ 0.20. To be faithful to the original supervised training of DeepTime, we therefore randomly mask out 50% of the observations that we use as context for the Ridge Regressor and try to infer the other 50% (the targets) to train the INR.

We provide a qualitative comparison of the model’s performance with these two different training procedures in Figure 7. We can notice that the model that results from the self-supervised training perfectly fits the observations but completely misses the important patterns of the series. On the other hand, when DeepTime is trained to infer target values based on observations, it is able to capture the general trends. We think that in the small subsampling regime (τ0.20𝜏0.20\tau\leq 0.20italic_τ ≤ 0.20), the Ridge Regressor easily fits very well all the observations which hinders the training of the INR’s basis.

Refer to caption
Figure 7: Electricity dataset. Self supervised DeepTime imputation (blue line) and supervised DeepTime imputation (black line) with 5%percent\%% of known point (red points) on the eight first days of samples 11 (top) and 29 (bottom).

D.4 Imputation against non deep learning methods

Setting

In addition to the deep learning methods presented in Table 1, we evalute TimeFlow against two classic machine learning baselines, K-Nearest Neighbours (KNN) and linear interpolation, which are valuable for getting an idea of the complexity of the problem.

Table 19: Mean MAE imputation results on the missing grid only over five different time window. τ𝜏\tauitalic_τ stands for the subsampling rate. Bold results are best, underline results are second best.
τ𝜏\tauitalic_τ TimeFlow Linear interpolation KNN (k=3)
0.05 0.324 ±plus-or-minus\pm± 0.013 0.828 ±plus-or-minus\pm± 0.045 0.531 ±plus-or-minus\pm± 0.033
0.10 0.250 ±plus-or-minus\pm± 0.010 0.716 ±plus-or-minus\pm± 0.039 0.416 ±plus-or-minus\pm± 0.020
Electricity 0.20 0.225 ±plus-or-minus\pm± 0.008 0.518 ±plus-or-minus\pm± 0.029 0.363 ±plus-or-minus\pm± 0.019
0.30 0.212 ±plus-or-minus\pm± 0.007 0.396 ±plus-or-minus\pm± 0.022 0.342 ±plus-or-minus\pm± 0.017
0.50 0.194 ±plus-or-minus\pm± 0.007 0.275 ±plus-or-minus\pm± 0.015 0.323 ±plus-or-minus\pm± 0.016
0.05 0.095 ±plus-or-minus\pm± 0.015 0.339 ±plus-or-minus\pm± 0.031 0.151 ±plus-or-minus\pm± 0.017
0.10 0.083 ±plus-or-minus\pm± 0.015 0.170 ±plus-or-minus\pm± 0.014 0.128 ±plus-or-minus\pm± 0.017
Solar 0.20 0.072 ±plus-or-minus\pm± 0.015 0.088 ±plus-or-minus\pm± 0.010 0.110 ±plus-or-minus\pm± 0.016
0.30 0.061 ±plus-or-minus\pm± 0.012 0.063 ±plus-or-minus\pm± 0.009 0.103 ±plus-or-minus\pm± 0.017
0.50 0.054 ±plus-or-minus\pm± 0.013 0.044 ±plus-or-minus\pm± 0.008 0.096 ±plus-or-minus\pm± 0.016
0.05 0.283 ±plus-or-minus\pm± 0.016 0.813 ±plus-or-minus\pm± 0.027 0.387 ±plus-or-minus\pm± 0.014
0.10 0.211 ±plus-or-minus\pm± 0.012 0.701 ±plus-or-minus\pm± 0.026 0.293 ±plus-or-minus\pm± 0.012
Traffic 0.20 0.168 ±plus-or-minus\pm± 0.006 0.508 ±plus-or-minus\pm± 0.022 0.249 ±plus-or-minus\pm± 0.010
0.30 0.151 ±plus-or-minus\pm± 0.007 0.387 ±plus-or-minus\pm± 0.018 0.228 ±plus-or-minus\pm± 0.009
0.50 0.139 ±plus-or-minus\pm± 0.007 0.263 ±plus-or-minus\pm± 0.013 0.204 ±plus-or-minus\pm± 0.009
TimeFlow improvement / 49.06 %percent\%% 35.95 %percent\%%
Results

KNN imputation uses information from other individuals and gives satisfactory results at all sampling rates. On the other hand, the purely univariate approach of linear interpolation struggles at low sampling rates but performs well at high sampling rates. TimeFlow significantly outperforms both baselines by a large margin.

Appendix E Forecasting experiments

E.1 Distinction between adjacent time windows and new time windows during inference

In Section 4.2, we presented the forecasting results for periods outside the training period. These periods can be classified into two types: adjacent to or disjoint from the training period. Figure 8 illustrates these distinct test periods for the Electricity dataset. The same principle applies to the Traffic and SolarH datasets, with one notable difference: the number of test periods is smaller in these datasets compared to Electricity dataset due to the fewer time steps available.

In Table 2, we presented the results indistinctly for the two types of test periods: adjacent to and disjoint from the training window. Here, we aim to differentiate the results for these two types of window and emphasize their significant impact on Informer and AutoFormer results. Specifically, Table 20 showcases the results for the test periods adjacent to the training window. In contrast, Table 21 displays the results for the test periods disjointed from the training window

Refer to caption
Figure 8: Distinction between adjacent time windows and new time windows during inference for the Electricity dataset
Results

TimeFlow, PatchTST, DLinear and DeepTime maintain consistent forecasting results whether tested on the period adjacent to the training period or on a disjoint period. However, AutoFormer and Informer show a significant drop in performance when tested on new disjoint periods.

Table 20: Mean MAE forecast results for adjacent time windows. H stands for the horizon. Bold results are best, underline results are second best.
Continuous methods Discrete methods
H𝐻Hitalic_H TimeFlow DeepTime Neural Process Patch-TST DLinear AutoFormer Informer
Electricity 96 0.218 ±plus-or-minus\pm± 0.017 0.240 ±plus-or-minus\pm± 0.027 0.392 ±plus-or-minus\pm± 0.045 0.214 ±plus-or-minus\pm± 0.020 0.236 ±plus-or-minus\pm± 0.035 0.310 ±plus-or-minus\pm± 0.031 0.293 ±plus-or-minus\pm± 0.0184
192 0.238 ±plus-or-minus\pm± 0.012 0.251 ±plus-or-minus\pm± 0.023 0.401 ±plus-or-minus\pm± 0.046 0.225 ±plus-or-minus\pm± 0.017 0.248 ±plus-or-minus\pm± 0.032 0.322 ±plus-or-minus\pm± 0.046 0.336 ±plus-or-minus\pm± 0.032
336 0.265 ±plus-or-minus\pm± 0.036 0.290 ±plus-or-minus\pm± 0.034 0.434 ±plus-or-minus\pm± 0.075 0.242 ±plus-or-minus\pm± 0.024 0.284 ±plus-or-minus\pm± 0.043 0.330 ±plus-or-minus\pm± 0.019 0.405 ±plus-or-minus\pm± 0.044
720 0.318 ±plus-or-minus\pm± 0.073 0.356 ±plus-or-minus\pm± 0.060 0.605 ±plus-or-minus\pm± 0.149 0.291 ±plus-or-minus\pm± 0.040 0.370 ±plus-or-minus\pm± 0.086 0.456 ±plus-or-minus\pm± 0.052 0.489 ±plus-or-minus\pm± 0.072
SolarH 96 0.172 ±plus-or-minus\pm± 0.017 0.197 ±plus-or-minus\pm± 0.002 0.221 ±plus-or-minus\pm± 0.048 0.232 ±plus-or-minus\pm± 0.008 0.204 ±plus-or-minus\pm± 0.002 0.261 ±plus-or-minus\pm± 0.053 0.273 ±plus-or-minus\pm± 0.023
192 0.198 ±plus-or-minus\pm± 0.010 0.202 ±plus-or-minus\pm± 0.014 0.244 ±plus-or-minus\pm± 0.048 0.231 ±plus-or-minus\pm± 0.027 0.211 ±plus-or-minus\pm± 0.012 0.312 ±plus-or-minus\pm± 0.085 0.256 ±plus-or-minus\pm± 0.026
336 0.207 ±plus-or-minus\pm± 0.019 0.200 ±plus-or-minus\pm± 0.012 0.241 ±plus-or-minus\pm± 0.005 0.254 ±plus-or-minus\pm± 0.048 0.212 ±plus-or-minus\pm± 0.019 0.341 ±plus-or-minus\pm± 0.107 0.287 ±plus-or-minus\pm± 0.006
720 0.215 ±plus-or-minus\pm± 0.016 0.240 ±plus-or-minus\pm± 0.011 0.403 ±plus-or-minus\pm± 0.147 0.271 ±plus-or-minus\pm± 0.036 0.246 ±plus-or-minus\pm± 0.015 0.368 ±plus-or-minus\pm± 0.006 0.341 ±plus-or-minus\pm± 0.049
Traffic 96 0.216 ±plus-or-minus\pm± 0.033 0.229 ±plus-or-minus\pm± 0.032 0.283 ±plus-or-minus\pm± 0.028 0.201 ±plus-or-minus\pm± 0.031 0.225 ±plus-or-minus\pm± 0.034 0.299 ±plus-or-minus\pm± 0.080 0.324 ±plus-or-minus\pm± 0.113
192 0.208 ±plus-or-minus\pm± 0.021 0.220 ±plus-or-minus\pm± 0.020 0.292 ±plus-or-minus\pm± 0.023 0.195 ±plus-or-minus\pm± 0.024 0.215 ±plus-or-minus\pm± 0.022 0.320 ±plus-or-minus\pm± 0.036 0.321 ±plus-or-minus\pm± 0.052
336 0.237 ±plus-or-minus\pm± 0.040 0.247 ±plus-or-minus\pm± 0.033 0.305 ±plus-or-minus\pm± 0.039 0.220 ±plus-or-minus\pm± 0.036 0.244 ±plus-or-minus\pm± 0.035 0.450 ±plus-or-minus\pm± 0.127 0.394 ±plus-or-minus\pm± 0.066
720 0.266 ±plus-or-minus\pm± 0.048 0.290 ±plus-or-minus\pm± 0.045 0.339 ±plus-or-minus\pm± 0.037 0.268 ±plus-or-minus\pm± 0.050 0.290 ±plus-or-minus\pm± 0.047 0.630 ±plus-or-minus\pm± 0.043 0.441 ±plus-or-minus\pm± 0.055
TimeFlow improvement / 6.56 %percent\%% 30.79 %percent\%% 2.64 %percent\%% 7.30 %percent\%% 35.43 %percent\%% 33.07 %percent\%%
Table 21: Mean MAE forecast results for new time windows. H stands for the horizon. Bold results are best, underline results are second best.
Continuous methods Discrete methods
H𝐻Hitalic_H TimeFlow DeepTime Neural Process Patch-TST DLinear AutoFormer Informer
Electricity 96 0.230 ±plus-or-minus\pm± 0.012 0.245 ±plus-or-minus\pm± 0.026 0.392 ±plus-or-minus\pm± 0.045 0.222 ±plus-or-minus\pm± 0.023 0.240 ±plus-or-minus\pm± 0.025 0.606 ±plus-or-minus\pm± 0.281 0.605 ±plus-or-minus\pm± 0.227
192 0.246 ±plus-or-minus\pm± 0.025 0.252 ±plus-or-minus\pm± 0.018 0.401 ±plus-or-minus\pm± 0.046 0.231 ±plus-or-minus\pm± 0.020 0.257 ±plus-or-minus\pm± 0.027 0.545 ±plus-or-minus\pm± 0.186 0.776 ±plus-or-minus\pm± 0.257
336 0.271 ±plus-or-minus\pm± 0.029 0.285 ±plus-or-minus\pm± 0.034 0.434 ±plus-or-minus\pm± 0.076 0.253 ±plus-or-minus\pm± 0.027 0.298 ±plus-or-minus\pm± 0.051 0.571 ±plus-or-minus\pm± 0.181 0.823 ±plus-or-minus\pm± 0.241
720 0.316 ±plus-or-minus\pm± 0.051 0.359 ±plus-or-minus\pm± 0.048 0.607 ±plus-or-minus\pm± 0.15 0.299 ±plus-or-minus\pm± 0.038 0.373 ±plus-or-minus\pm± 0.075 0.674 ±plus-or-minus\pm± 0.245 0.811 ±plus-or-minus\pm± 0.257
SolarH 96 0.208 ±plus-or-minus\pm± 0.005 0.206 ±plus-or-minus\pm± 0.026 0.221 ±plus-or-minus\pm± 0.048 0.293 ±plus-or-minus\pm± 0.089 0.212 ±plus-or-minus\pm± 0.019 0.228 ±plus-or-minus\pm± 0.027 0.234 ±plus-or-minus\pm± 0.011
192 0.206 ±plus-or-minus\pm± 0.012 0.207 ±plus-or-minus\pm± 0.037 0.244 ±plus-or-minus\pm± 0.048 0.274 ±plus-or-minus\pm± 0.060 0.223 ±plus-or-minus\pm± 0.029 0.356 ±plus-or-minus\pm± 0.122 0.280 ±plus-or-minus\pm± 0.033
336 0.211 ±plus-or-minus\pm± 0.005 0.199 ±plus-or-minus\pm± 0.035 0.240 ±plus-or-minus\pm± 0.006 0.264 ±plus-or-minus\pm± 0.088 0.223 ±plus-or-minus\pm± 0.032 0.327 ±plus-or-minus\pm± 0.029 0.366 ±plus-or-minus\pm± 0.039
720 0.222 ±plus-or-minus\pm± 0.020 0.217 ±plus-or-minus\pm± 0.028 0.403 ±plus-or-minus\pm± 0.147 0.262 ±plus-or-minus\pm± 0.083 0.251 ±plus-or-minus\pm± 0.047 0.335 ±plus-or-minus\pm± 0.075 0.333 ±plus-or-minus\pm± 0.012
Traffic 96 0.218 ±plus-or-minus\pm± 0.042 0.229 ±plus-or-minus\pm± 0.032 0.283, 0.0275 0.204 ±plus-or-minus\pm± 0.039 0.229 ±plus-or-minus\pm± 0.032 0.326 ±plus-or-minus\pm± 0.049 0.388 ±plus-or-minus\pm± 0.055
192 0.213 ±plus-or-minus\pm± 0.028 0.220 ±plus-or-minus\pm± 0.023 0.292, 0.0236 0.198 ±plus-or-minus\pm± 0.031 0.223 ±plus-or-minus\pm± 0.023 0.575 ±plus-or-minus\pm± 0.254 0.381 ±plus-or-minus\pm± 0.049
336 0.239 ±plus-or-minus\pm± 0.035 0.244 ±plus-or-minus\pm± 0.040 0.305, 0.0392 0.223 ±plus-or-minus\pm± 0.040 0.252 ±plus-or-minus\pm± 0.042 0.598 ±plus-or-minus\pm± 0.286 0.448 ±plus-or-minus\pm± 0.055
720 0.280 ±plus-or-minus\pm± 0.047 0.290 ±plus-or-minus\pm± 0.055 0.339, 0.0375 0.270 ±plus-or-minus\pm± 0.059 0.304 ±plus-or-minus\pm± 0.061 0.641 ±plus-or-minus\pm± 0.072 0.468 ±plus-or-minus\pm± 0.064
TimeFlow improvement / 2.50 %percent\%% 27.75 %percent\%% 3.41 %percent\%% 6.80 %percent\%% 46.26 %percent\%% 45.53 %percent\%%

E.2 Plots comparison: TimeFlow vs PatchTST

Table 2 demonstrates the similar forecasting performance of TimeFlow and PatchTST across all horizons. To visually represent their predictions, the figures below showcase the forecasted outcomes of these methods for two samples (24 and 38) and two horizons (96 and 192) on the Electricity, SolarH, and Traffic datasets.

Refer to caption
Figure 9: Qualitative comparisons of TimeFlow vs PatchTST on the Electricity dataset for new time windows
Refer to caption
Figure 10: Qualitative comparisons of TimeFlow vs PatchTST on the SolarH dataset for new time windows
Refer to caption
Figure 11: Qualitative comparisons of TimeFlow vs PatchTST on the Traffic dataset for new time windows
Results

The visual analysis of the figures above reveals that the predictions of TimeFlow and PatchTST are remarkably similar. For instance, when examining sample 24 and horizon 192 of the Traffic dataset, both forecasters exhibit similar error patterns. The only noticeable distinction emerges in the SolarH dataset, where PatchTST tends to overestimate certain peaks.

E.3 Baseline details

E.3.1 Baselines training and hyperparameters

We provide a detailed breakdown of the hyperparameters and our training approach for the forecasting baselines. We took an in-depth approach, testing each method under a range of configurations to ensure they were well-suited to the unique characteristics of the datasets and tasks at hand. For DLinear and transformer baselines, including PatchTST, AutoFormer, and Informer, we utilized the implementations detailed in the PatchTST baselines (code) and adhered to the best practices recommended for our particular tasks. Notably, our implementation of PatchTST was combined with ReVIN, enhancing the robustness of the results. Regarding DeepTime, we followed the recommended hyperparameters, opting for a structure with 5 layers, each 256 units wide, and using 4096 Fourier features spanning a diverse set of scales. As for the Neural Process, the standard model did not train as expected. So, we customized its architecture to conduct a fair comparison with TimeFlow. We used the INR and hypernetwork from TimeFlow to align the Neural Process with our temporal frequency bias and shift modulation technique. We also meticulously searched for the optimal hyperparameters, like the Kullback Leibler (KL) divergence weight and learning rate. Moreover, we extended the training duration to ensure thorough convergence.

E.3.2 Models complexity

In this section, we present the parameter counts and the inference time for the main forecasting baselines. Except for TimeFlow and DeepTime, the number of parameters varies with the number of samples, the look-back window, and the horizon. Thus, we report the number of parameters for two specific configurations, including a fixed dataset, a fixed look-back window, and a fixed horizon. In Table 22, we see that for PatchTST and DLinear, the larger the horizon, the more the number of parameters increases. In Table 23, it is shown that all methods’ computational time increases with the horizon, which is expected. Moreover, TimeFlow is slower than the baselines that use forward computations only. Still, on the Electricity dataset, for example, the method can infer for 321 samples a horizon of 720 values with a look-back window of 512 timestamps in less than 0.2s, which does not look prohibitive for many real-world usages. This is mainly due to the small number of gradient steps at inference.

Table 22: The number of parameters for main baselines on the forecasting task on the Electricity dataset for horizons 96 and 720. The look-back window size is 512.
TimeFlow DeepTime Neural Process Patch-TST DLinear Informer Autoformer
96 602k 1 315k 480k 1 194k 98k 984k 1 005k
720 602k 1 315k 480k 6 306k 739k 984k 1 005k
Table 23: Inference time (in seconds) for the forecasting task on the Electricity dataset with horizons 96 and 720 and a look-back window of length 512. The statistics are computed over 10 runs using an NVIDIA TITAN RTX GPU.
TimeFlow Patch-TST DLinear DeepTime AutoFormer Informer
96 0.147 ±plus-or-minus\pm± 0.007 0.016 ±plus-or-minus\pm± 0.002 0.007 ±plus-or-minus\pm± 0.003 0.006 ±plus-or-minus\pm± 0.002 0.027 ±plus-or-minus\pm± 0.001 0.0191 ±plus-or-minus\pm± 0.002
720 0.176 ±plus-or-minus\pm± 0.009 0.020 ±plus-or-minus\pm± 0.001 0.009 ±plus-or-minus\pm± 0.001 0.010 ±plus-or-minus\pm± 0.002 0.034±plus-or-minus\pm± 0.001 0.0251 ±plus-or-minus\pm± 0.002

E.4 Sparsely observed look-back window: comparison with Patch-TST

Setting and baseline.

Let’s consider a setting where at inference time, the look-back window is sparsely observed. Models such as PatchTST must proceed in two steps: (i) completing the look-back window on a dense regular grid using imputation; (ii) apply the model on the completed window to predict the future. We compared TimeFlow with the following two-step processing baseline: linear interpolation handling the missing values within the partially observed look-back window, and PatchTST handling the forecasting task. We conducted experiments on the Traffic and Electricity datasets, focusing on the 96 and 192 horizons. In Table 24, we present the results at different sampling rates τ{0.5,0.2,0.1}𝜏0.50.20.1\tau\in\{0.5,0.2,0.1\}italic_τ ∈ { 0.5 , 0.2 , 0.1 } within the look-back window.

Table 24: MAE results for forecasting on new samples and new period with missing values in the look-back window. Best results are in bold.
TimeFlow Linear interpo +++ PatchTST
H τ𝜏\tauitalic_τ Imputation error Forecast error Imputation error Forecast error
Electricity 96 1. 0.000 ±plus-or-minus\pm± 0.000 0.228 ±plus-or-minus\pm± 0.028 0.000 ±plus-or-minus\pm± 0.000 0.221 ±plus-or-minus\pm± 0.023
0.5 0.151 ±plus-or-minus\pm± 0.003 0.239 ±plus-or-minus\pm± 0.013 0.257 ±plus-or-minus\pm± 0.008 0.279 ±plus-or-minus\pm± 0.026
0.2 0.208 ±plus-or-minus\pm± 0.006 0.260 ±plus-or-minus\pm± 0.015 0.482 ±plus-or-minus\pm± 0.019 0.451 ±plus-or-minus\pm± 0.042
0.1 0.272 ±plus-or-minus\pm± 0.006 0.295 ±plus-or-minus\pm± 0.016 0.663 ±plus-or-minus\pm± 0.029 0.634 ±plus-or-minus\pm± 0.053
192 1. 0.000 ±plus-or-minus\pm± 0.000 0.238 ±plus-or-minus\pm± 0.020 0.000 ±plus-or-minus\pm± 0.000 0.229 ±plus-or-minus\pm± 0.020
0.5 0.149 ±plus-or-minus\pm± 0.004 0.235 ±plus-or-minus\pm± 0.011 0.258 ±plus-or-minus\pm± 0.006 0.280 ±plus-or-minus\pm± 0.032
0.2 0.209 ±plus-or-minus\pm± 0.006 0.257 ±plus-or-minus\pm± 0.013 0.481 ±plus-or-minus\pm± 0.021 0.450 ±plus-or-minus\pm± 0.054
0.1 0.274 ±plus-or-minus\pm± 0.010 0.289 ±plus-or-minus\pm± 0.016 0.669 ±plus-or-minus\pm± 0.030 0.650 ±plus-or-minus\pm± 0.060
Traffic 96 1. 0.000 ±plus-or-minus\pm± 0.000 0.217 ±plus-or-minus\pm± 0.032 0.000 ±plus-or-minus\pm± 0.000 0.203 ±plus-or-minus\pm± 0.037
0.5 0.219 ±plus-or-minus\pm± 0.017 0.224 ±plus-or-minus\pm± 0.033 0.276 ±plus-or-minus\pm± 0.012 0.255 ±plus-or-minus\pm± 0.041
0.2 0.278 ±plus-or-minus\pm± 0.017 0.252 ±plus-or-minus\pm± 0.029 0.532 ±plus-or-minus\pm± 0.017 0.483 ±plus-or-minus\pm± 0.040
0.1 0.418 ±plus-or-minus\pm± 0.019 0.382 ±plus-or-minus\pm± 0.014 0.738 ±plus-or-minus\pm± 0.023 0.721 ±plus-or-minus\pm± 0.073
192 1. 0.000 ±plus-or-minus\pm± 0.000 0.212 ±plus-or-minus\pm± 0.028 0.000 ±plus-or-minus\pm± 0.000 0.197 ±plus-or-minus\pm± 0.030
0.5 0.176 ±plus-or-minus\pm± 0.014 0.217 ±plus-or-minus\pm± 0.017 0.276 ±plus-or-minus\pm± 0.011 0.245 ±plus-or-minus\pm± 0.029
0.2 0.233 ±plus-or-minus\pm± 0.017 0.236 ±plus-or-minus\pm± 0.021 0.532 ±plus-or-minus\pm± 0.020 0.480 ±plus-or-minus\pm± 0.050
0.1 0.304 ±plus-or-minus\pm± 0.019 0.277 ±plus-or-minus\pm± 0.021 0.734 ±plus-or-minus\pm± 0.022 0.787 ±plus-or-minus\pm± 0.172
Results.

Although PatchTST performs slightly better with a dense look-back window, its performance significantly deteriorates as the value of τ𝜏\tauitalic_τ decreases. In contrast, the performance of TimeFlow is only minimally affected by the reduction in the sampling rate.

E.5 Influence of the look-back window for forecasting

In Figure 12, it is shown that both excessively short and overly long look-back windows can harm TimeFlow forecasting performance. More precisely, the performances increases with the look-back window size up to a certain size, where the performances then drop slowly.

Refer to caption
Figure 12: MAE forecast error per look-back windows length for the Electricity dataset (horizon window length is 336). The model is trained on a given time window and tested on four new time windows.

E.6 Influence of the horizon length for forecasting

In Figure 13, it is shown that the performances decrease with the length of the horizon. This is to be expected, since the longer the horizon, the harder the task.

Refer to caption
Figure 13: MAE forecast error per horizons length for the Electricity dataset (look-back window length is 512). The model is trained on a given time window and tested on four new time windows.

Appendix F Discussion on using frequency embedding as input to regression models

F.1 Related work

In forecasting univariate time series, several models have explored the integration of frequency embedding of timestamps for regression purposes. Taylor and Letham [2018] approached the time series as a continuous function of time using a general additive model [Hastie, 2017]. They represented seasonality components as learnable Fourier series while explicitly specifying the ground truth seasonalities (e.g., weekly, monthly). Similarly, Hyndman and Athanasopoulos [2018] proposed embedding the timestamp t𝑡titalic_t with the ground truth frequencies and applying a regression model to predict the series value at the timestamp t𝑡titalic_t. Both methods rely on the explicit specification of seasonalities and are tailored for purely univariate time series, where information is not shared between samples.

In contrast, other models, such as TimeFlow or DeepTime [Woo et al., 2022], based on deep learning techniques, offer more flexibility. These approaches can autonomously learn relevant frequencies and effectively share information between samples through backpropagation. This enables a more dynamic and adaptable approach to time series forecasting, particularly in scenarios with complex temporal patterns and inter-sample dependencies.

F.2 Experiments

Given the seasonal patterns observed in the Solar, Electricity, and Traffic datasets, an alternative to deep learning forecasting methods is to individually regress a timestamp embedding on the corresponding value using a robust regressor. This approach exploits the inherent periodicity in the data, using timestamp embeddings to capture temporal dependencies and accurately predict the target values.

Baselines.

We compare TimeFlow against two regression baselines:

  • TimeFlow frequencies embedding +++ XGBoost: This baseline uses the same frequency embedding as TimeFlow and applies an XGBoost regressor [Chen and Guestrin, 2016] on top. The aim is to assess whether the XGBoost regressor can effectively identify the correct frequencies, filter out irrelevant ones, and establish the appropriate mapping between timestamps and values.

  • XGB Explicit Seasonal encoding + XGboost: in this baseline, we give explicitly the right frequencies of each datasets to the model. It is important to highlight that this method uses information that other baselines don’t. For instance for the Traffic dataset, the Explicit Seasonal encoding is γ(t)=(t,cos(2πt24),sin(2πt24),cos(2πt24×7),sin(2πt24×7))𝛾𝑡𝑡2𝜋𝑡242𝜋𝑡242𝜋𝑡2472𝜋𝑡247\gamma(t)=\left(t,\cos\left(\frac{2\pi t}{24}\right),\sin\left(\frac{2\pi t}{2% 4}\right),\cos\left(\frac{2\pi t}{24\times 7}\right),\sin\left(\frac{2\pi t}{2% 4\times 7}\right)\right)italic_γ ( italic_t ) = ( italic_t , roman_cos ( divide start_ARG 2 italic_π italic_t end_ARG start_ARG 24 end_ARG ) , roman_sin ( divide start_ARG 2 italic_π italic_t end_ARG start_ARG 24 end_ARG ) , roman_cos ( divide start_ARG 2 italic_π italic_t end_ARG start_ARG 24 × 7 end_ARG ) , roman_sin ( divide start_ARG 2 italic_π italic_t end_ARG start_ARG 24 × 7 end_ARG ) ). It allows to explicitly integers trend, daily frequencies and weekly frequencies. We apply the same type of frequencies embedding for Solar𝑆𝑜𝑙𝑎𝑟Solaritalic_S italic_o italic_l italic_a italic_r and Electricity𝐸𝑙𝑒𝑐𝑡𝑟𝑖𝑐𝑖𝑡𝑦Electricityitalic_E italic_l italic_e italic_c italic_t italic_r italic_i italic_c italic_i italic_t italic_y with the appropriate seasonalities.

Experimental Setup.

For each dataset and sample, we applied the frequency embedding individually and then trained an XGBoost regressor on each observed timestamp (the look-back window in forecasting and the observed grid in imputation). This approach results in one model per sample. The XGBoost regressor is configured with the following hyperparameters:

  • n estimators: 500

  • max depth: 4

  • learning rate: 0.1

  • lambda: [0.1, 1, 10]

The optimal regularization parameter lambda is determined through cross-validation. The imputation and forecasting results are presented in detail in Table 25 and Table 26, respectively.

Table 25: Mean MAE imputation results on the missing grid only. In the table, τ𝜏\tauitalic_τ stands for the subsampling rate, i.e. the proportion of observed points considered for each samples. Bold results are best, underlined results are second best.
τ𝜏\tauitalic_τ TimeFlow TimeFlow frequencies embedding+ XGB Explicit Seasonal encoding + XGB
0.05 0.324 ±plus-or-minus\pm± 0.013 0.834 ±plus-or-minus\pm± 0.092 0.365 ±plus-or-minus\pm± 0.051
0.10 0.250 ±plus-or-minus\pm± 0.010 0.761 ±plus-or-minus\pm± 0.074 0.318 ±plus-or-minus\pm± 0.049
Electricity 0.20 0.225 ±plus-or-minus\pm± 0.008 0.632 ±plus-or-minus\pm± 0.066 0.278 ±plus-or-minus\pm± 0.044
0.30 0.212 ±plus-or-minus\pm± 0.007 0.536 ±plus-or-minus\pm± 0.041 0.259 ±plus-or-minus\pm± 0.048
0.50 0.194 ±plus-or-minus\pm± 0.007 0.418 ±plus-or-minus\pm± 0.042 0.238 ±plus-or-minus\pm± 0.022
0.05 0.095 ±plus-or-minus\pm± 0.015 0.603 ±plus-or-minus\pm± 0.035 0.234 ±plus-or-minus\pm± 0.021
0.10 0.083 ±plus-or-minus\pm± 0.015 0.478 ±plus-or-minus\pm± 0.024 0.190 ±plus-or-minus\pm± 0.022
Solar 0.20 0.072 ±plus-or-minus\pm± 0.015 0.350 ±plus-or-minus\pm± 0.022 0.150 ±plus-or-minus\pm± 0.019
0.30 0.061 ±plus-or-minus\pm± 0.012 0.286 ±plus-or-minus\pm± 0.018 0.134 ±plus-or-minus\pm± 0.011
0.50 0.054 ±plus-or-minus\pm± 0.013 0.227 ±plus-or-minus\pm± 0.015 0.123 ±plus-or-minus\pm± 0.015
0.05 0.283 ±plus-or-minus\pm± 0.016 0.739 ±plus-or-minus\pm± 0.140 0.344 ±plus-or-minus\pm± 0.036
0.10 0.211 ±plus-or-minus\pm± 0.012 0.676 ±plus-or-minus\pm± 0.129 0.290 ±plus-or-minus\pm± 0.029
Traffic 0.20 0.168 ±plus-or-minus\pm± 0.006 0.562 ±plus-or-minus\pm± 0.108 0.245 ±plus-or-minus\pm± 0.027
0.30 0.151 ±plus-or-minus\pm± 0.007 0.487 ±plus-or-minus\pm± 0.095 0.223 ±plus-or-minus\pm± 0.015
0.50 0.139 ±plus-or-minus\pm± 0.007 0.393 ±plus-or-minus\pm± 0.083 0.198 ±plus-or-minus\pm± 0.021
TimeFlow improvement / 69.5 %percent\%% 33.7 %percent\%%
Imputation results.

TimeFlow performs better than the other two methods. Although the second baseline explicitly incorporates ground truth frequencies and provides decent results, its inability to share information between samples leads to the loss of valuable insights that TimeFlow effectively exploits. In addition, the first baseline struggles to learn the correct frequencies and overfits observed data points, resulting in excessive high-frequency noise. As a result, its performance degrades significantly compared to the second baseline, where frequencies are explicitly provided. These findings underscore TimeFlow’s ability to identify the underlying frequencies, and leverage shared information across samples to improve accuracy during imputation.

Table 26: Mean MAE forecast results for adjacent time windows averaged over different time windows. Each time, the model is trained on one time window and tested on the others (there are 2 windows for SolarH and 5 for Electricity and Traffic). H𝐻Hitalic_H stands for the horizon. Bold results are best, and underlined results are second best
H𝐻Hitalic_H TimeFlow TimeFlow frequencies embedding + XGB Explicit Seasonnal encoding + XGB
Electricity 96 0.218 ±plus-or-minus\pm± 0.017 0.662 ±plus-or-minus\pm± 0.102 0.282 ±plus-or-minus\pm± 0.020
192 0.238 ±plus-or-minus\pm± 0.012 0.750 ±plus-or-minus\pm± 0.128 0.279 ±plus-or-minus\pm± 0.021
336 0.265 ±plus-or-minus\pm± 0.036 0.809 ±plus-or-minus\pm± 0.136 0.294 ±plus-or-minus\pm± 0.041
720 0.318 ±plus-or-minus\pm± 0.073 0.852 ±plus-or-minus\pm± 0.144 0.357 ±plus-or-minus\pm± 0.092
SolarH 96 0.172 ±plus-or-minus\pm± 0.017 0.792 ±plus-or-minus\pm± 0.062 0.244 ±plus-or-minus\pm± 0.023
192 0.198 ±plus-or-minus\pm± 0.010 0.933 ±plus-or-minus\pm± 0.055 0.236 ±plus-or-minus\pm± 0.018
336 0.207 ±plus-or-minus\pm± 0.019 1.033 ±plus-or-minus\pm± 0.052 0.229 ±plus-or-minus\pm± 0.022
720 0.215 ±plus-or-minus\pm± 0.016 1.116 ±plus-or-minus\pm± 0.057 0.262 ±plus-or-minus\pm± 0.021
Traffic 96 0.216 ±plus-or-minus\pm± 0.033 0.655 ±plus-or-minus\pm± 0.156 0.288 ±plus-or-minus\pm± 0.052
192 0.208 ±plus-or-minus\pm± 0.021 0.678 ±plus-or-minus\pm± 0.139 0.246 ±plus-or-minus\pm± 0.033
336 0.237 ±plus-or-minus\pm± 0.040 0.719 ±plus-or-minus\pm± 0.143 0.262 ±plus-or-minus\pm± 0.044
720 0.266 ±plus-or-minus\pm± 0.048 0.741 ±plus-or-minus\pm± 0.140 0.288 ±plus-or-minus\pm± 0.063
TimeFlow improvement / 70.8 % 15.7 %
Forecasting results.

TimeFlow also outperforms the other two baselines in forecasting. However, the improvement over the second baseline, where the correct frequencies are explicitly provided, is more modest compared to the gains observed in imputation. Nevertheless, the relative improvement achieved by TimeFlow remains significant (exceeding 15 %percent\%%). Similar to the imputation scenario, the XGBoost baseline with TimeFlow timestamps encoding, which attempts to learn the correct frequencies, fails to discern them accurately and introduces excessive high-frequency components.

Appendix G Latent space exploration

G.1 Latent space interpolation between two learned codes

It is interesting to understand how the latent space behaves between two learned codes, z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which are representations of x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We propose in Figure 14 to visualize how new zλ=λz1+(1λ)z2subscript𝑧𝜆𝜆subscript𝑧11𝜆subscript𝑧2z_{\lambda}=\lambda z_{1}+(1-\lambda)z_{2}italic_z start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = italic_λ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( 1 - italic_λ ) italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are decoded in the time-series domain (fθ,hw(zλ)(t)subscript𝑓𝜃subscript𝑤subscript𝑧𝜆𝑡f_{\theta,h_{w}({z_{\lambda}})}(t)italic_f start_POSTSUBSCRIPT italic_θ , italic_h start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_t ) values).

Setting.

We choose two z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT learned from the Electricity dataset and interpolate these two latent codes for λ{0.0,0.1,0.25,0.50,0.75,0.9,1.}.\lambda\in\{0.0,0.1,0.25,0.50,0.75,0.9,1.\}.italic_λ ∈ { 0.0 , 0.1 , 0.25 , 0.50 , 0.75 , 0.9 , 1 . } .

Refer to caption
Figure 14: Visualization of the reconstructed time series for different linear interpolation of the two codes z1subscript𝑧1z_{1}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and z2subscript𝑧2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT learned from the Electricity dataset.
Results.

In Figure 14, we observe that the interpolation path between two codes yields a smooth transition in the time series domain. This suggests that the latent space is smooth and well-structured, and that the learned representations captured meaningful features of the time series, which could explain TimeFlow’s generalization property.

G.2 TimeFlow sensitivity to modulations perturbation

In the preceding section, we observed the smoothness of the latent space. A crucial question arises: can we interpret each dimension of the latent space independently ?

Setting

In this setting, we perturbed a specific dimension of the modulation (by adding Gaussian noise) for only one particular layer of the INR for the Electricity dataset. Then, we observe the difference in the time domain between the non-perturbated TimeFlow and the perturbated one. For instance in Figure 15, we add noise only for the third layer of the INR and the 50th channel.

Refer to caption
Figure 15: Effect of adding a small perturbation to the modulation shift of the third layer and 50th channel.
Refer to caption
Figure 16: Effect of adding a small perturbation to the modulation shift of the third layer and 51th channel.
Refer to caption
Figure 17: Effect of adding a small perturbation to the modulation shift of the fourth layer and 50th channel.
Results.

In Figure 15, we observed that adding a small perturbation added a smooth daily frequency pattern. In Figure 16, we observed that adding a small perturbation induces a bias that impacts the high frequencies but does not affect the low frequencies. Finally, in Figure 17, adding a small perturbation induces a very local and slight bias (the effect is almost null). In conclusion, the impact of the small perturbation depends on the channel and the layer, but it is hard to interpret each dimension independently.

G.3 Visualization of two code distributions in the latent space

Examining the behavior of the latent space at the instance level is of particular interest. It provides insights into how individual time series evolve within the latent space. However, exploring the latent space between two time series distributions is also crucial.

Setting.

We propose encoding all samples (321 samples) from the Electricity dataset for two distinct time periods (each period is about 25 days \approx 600 timestamps). This results in two distributions of latent codes, each representing different temporal support. Then, we employ Principal Component Analysis (PCA) to visualize these two latent code distributions in a two-dimensional space, as illustrated in Figure 18. This visualization allows us to explore the structural differences, similarities, and temporal variations in the latent space representations across the specified time intervals. In Figure 18(a), the two compared time period are separated by approximately 3 months (\approx 2000 timestamps). In Figure 18(b), the two compared time period are separated by approximately 6 months (\approx 4000 timestamps).

Refer to caption
(a) The temporal shift between the two codes distribution is 3 months.
Refer to caption
(b) The temporal shift between the two codes distribution is 6 months.
Figure 18: Visualization of the two first PCA axes for two distributions of the latent codes (temporal distribution shift).
Results.

Figure 18(a) shows that when the temporal periods are not too far from each other, the distributions of codes can largely overlap in the 2D visualization from the PCA. Conversely, as illustrated in Figure 18(b), when the temporal periods are far from each other, the distributions of codes between the two periods become more distinct in the 2D visualization. This observation suggests that the proximity or disparity in temporal distribution shift influences the separability of latent representations in the 2D PCA space. However, this presumed separability in the latent space does not seem to significantly impact the generalization performance of TimeFlow across time, as evidenced by the results presented in Table 2 and Table 21. This suggests that our INR can handle relatively diverse code distributions.

Appendix H The intuition behind the meta-learning optimization in time series forecasting

The concept of inner and outer loops is to be reformulated within the broader general framework of model-agnostic meta-learning [Finn et al., 2017], where the authors seek to enable rapid adaptation of the model to unseen tasks. In TimeFlow, we adapt the general idea of agnostic meta-learning to our tasks. We propose an efficient way to achieve this goal by splitting the parameters into two parts: context parameters (learned in the inner loop, responsible for the adaptive part of the model) and meta-parameters (or ’parameters shared across tasks’) (learned in the outer loop, responsible for the generic part of the model).

In the context of time series forecasting.

TimeFlow aims to have a subset of parameters that adapts to specific contextual factors (e.g., the look-back window of a particular sample) and another subset that performs the forecasting task according to this learned context (e.g., forecasting any point within the forecast horizon as well as within the look-back window). To achieve this, we seek to adjust the codes z(j)superscript𝑧𝑗z^{(j)}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT exclusively on the contexts (e.g., the look-back window) for each sample j𝑗jitalic_j, while the shared parameters between samples θ𝜃\thetaitalic_θ and w𝑤witalic_w characterize a shared function capable of forecasting based on a given z(j)superscript𝑧𝑗z^{(j)}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT. This function could be represented as fθ,w(t,z(x(j))f_{\theta,w}(t,z(x^{(j)})italic_f start_POSTSUBSCRIPT italic_θ , italic_w end_POSTSUBSCRIPT ( italic_t , italic_z ( italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT )). This concept entails adapting z(j)superscript𝑧𝑗z^{(j)}italic_z start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT by sample j𝑗jitalic_j and training θ𝜃\thetaitalic_θ, w𝑤witalic_w by batch.