On shallow feedforward neural networks with inputs from a topological space

Vugar E. Ismailov


Institute of Mathematics and Mechanics, Baku, Azerbaijan

Center for Mathematics and its Applications, Khazar University, Baku, Azerbaijan

e-mail: [email protected]

Abstract. We study feedforward neural networks with inputs from a topological space (TFNNs). We prove a universal approximation theorem for shallow TFNNs, which demonstrates their capacity to approximate any continuous function defined on this topological space. As an application, we obtain an approximative version of Kolmogorov’s superposition theorem for compact metric spaces.


Mathematics Subject Classifications: 41A30, 41A65, 68T05

Keywords: feedforward neural network, universal approximation theorem, density, topological vector space, Tauber-Wiener function, Kolmogorov’s superposition theorem

1. Introduction

Neural networks are fundamental to contemporary machine learning and artificial intelligence, providing robust methods for tackling intricate challenges. Among the different neural network designs, the multilayer feedforward perceptron (MLP) is particularly prominent and essential. The MLP is valued for its capability to model complex, nonlinear functions and execute various tasks, including classification, regression, and pattern recognition.

This architecture consists of a limited number of sequential layers: an input layer at the beginning, an output layer at the end, and several hidden layers in between. Information progresses from the input layer through the hidden layers to the output layer. In this framework, each neuron in a layer receives inputs from the previous layer, applies specific weights, adds a bias, and then processes the result through an activation function. This activation function introduces non-linearity, allowing the model to learn and capture intricate patterns. The output from one layer’s neurons serves as the input for the neurons in the next layer, continuing this sequence until the final output is generated by the output layer.

The most basic form of an MLP features just one hidden layer. In this setup, each output neuron calculates a function expressed as

i=1rciσ(𝐰i𝐱θi),superscriptsubscript𝑖1𝑟subscript𝑐𝑖𝜎superscript𝐰𝑖𝐱subscript𝜃𝑖\sum_{i=1}^{r}c_{i}\sigma(\mathbf{w}^{i}\cdot\mathbf{x}-\theta_{i}),∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( bold_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ bold_x - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1.1)1.1( 1.1 )

where 𝐱=(x1,,xd)𝐱subscript𝑥1subscript𝑥𝑑\mathbf{x}=(x_{1},...,x_{d})bold_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) represents the input vector, r𝑟ritalic_r is the number of neurons in the hidden layer, 𝐰isuperscript𝐰𝑖\mathbf{w}^{i}bold_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are weight vectors in dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are thresholds, cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are coefficients, and σ𝜎\sigmaitalic_σ is the activation function, a real univariate function.

The theoretical underpinning of neural networks is rooted in the universal approximation property (UAP), sometimes referred to as the density property. This principle states that a neural network with a single hidden layer can approximate any continuous function over a compact domain to any desired level of precision. Specifically, the set span{σ(𝐰𝐱θ):θ,𝐰d}𝑠𝑝𝑎𝑛conditional-set𝜎𝐰𝐱𝜃formulae-sequence𝜃𝐰superscript𝑑span\{\sigma(\mathbf{w\cdot x}-\theta):\theta\in\mathbb{R},\mathbf{w}\in% \mathbb{R}^{d}\}italic_s italic_p italic_a italic_n { italic_σ ( bold_w ⋅ bold_x - italic_θ ) : italic_θ ∈ blackboard_R , bold_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }, which comprises functions defined in the format of equation (1.1), is dense in C(K)𝐶𝐾C(K)italic_C ( italic_K ) for every compact set Kd𝐾superscript𝑑K\subset\mathbb{R}^{d}italic_K ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Here C(K)𝐶𝐾C(K)italic_C ( italic_K ) represents the space of real-valued continuous functions on K𝐾Kitalic_K. This important result in neural network theory is known as the universal approximation theorem (UAT).

Extensive research has investigated the UAT across various activation functions σ𝜎\sigmaitalic_σ, examining how different choices influence the approximation capabilities of neural networks. The most general result in this area was obtained by Leshno, Lin, Pinkus and Schocken [14]. They proved that a continuous activation function σ𝜎\sigmaitalic_σ possesses the UAP if and only if it is not a polynomial. This result demonstrates the effectiveness of the single hidden layer perceptron model across a wide range of activation functions σ𝜎\sigmaitalic_σ. It should be noted that, the universal approximation theorem in [14] was shown to apply to a broader class of activation functions beyond just continuous ones, including activation functions that may have discontinuities on sets of Lebesgue measure zero. However, this paper will specifically concentrate on continuous activation functions. For a thorough, step-by-step proof of this theorem, refer to [19, 20].

In the past, it was commonly accepted and highlighted in numerous studies that attaining the universal approximation property necessitate large networks with a substantial number of hidden neurons (see, e.g., [4, Chapter 6.4.1]). In the above-mentioned earlier works, the number of hidden neurons was regarded as unbounded. However, more recent research [5, 6, 7] has demonstrated that neural networks using certain non-explicit but practically computable activation functions can approximate any continuous function over any compact set to any desired level of accuracy, even with a minimal and fixed number of hidden neurons.

Note that the inner product 𝐰i𝐱superscript𝐰𝑖𝐱\mathbf{w}^{i}\cdot\mathbf{x}bold_w start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ bold_x in (1.1) represents a linear continuous functional on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Conversely, by Riesz representation theorem, every linear functional on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is of the form 𝐰𝐱𝐰𝐱\mathbf{w\cdot x}bold_w ⋅ bold_x, where 𝐰d𝐰superscript𝑑\mathbf{w}\in\mathbb{R}^{d}bold_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝐱=(x1,,xd)𝐱subscript𝑥1subscript𝑥𝑑\mathbf{x}=(x_{1},...,x_{d})bold_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) is the variable (see [21, Theorem 13.32]). Linear continuous functionals constitute a significant subclass in C(d)𝐶superscript𝑑C(\mathbb{R}^{d})italic_C ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), denoted here by (d)superscript𝑑\mathcal{L}(\mathbb{R}^{d})caligraphic_L ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ). Thus, UAT asserts that for certain activation functions σ𝜎\sigmaitalic_σ and any compact set Kd𝐾superscript𝑑K\subset\mathbb{R}^{d}italic_K ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, the set

(σ)=span{σ(f(x)θ):f(d),θ}𝜎𝑠𝑝𝑎𝑛conditional-set𝜎𝑓𝑥𝜃formulae-sequence𝑓superscript𝑑𝜃\mathcal{M}(\sigma)=span\{\sigma(f(x)-\theta):f\in\mathcal{L}(\mathbb{R}^{d}),% \theta\in\mathbb{R}\}caligraphic_M ( italic_σ ) = italic_s italic_p italic_a italic_n { italic_σ ( italic_f ( italic_x ) - italic_θ ) : italic_f ∈ caligraphic_L ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) , italic_θ ∈ blackboard_R }

is dense in C(K)𝐶𝐾C(K)italic_C ( italic_K ). This observation tells the following generalization of single hidden layer networks from dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to any topological space X,𝑋X,italic_X , where (d)superscript𝑑\mathcal{L}(\mathbb{R}^{d})caligraphic_L ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) is replaced with a fixed family of functions (which need not be linear) from C(X)𝐶𝑋C(X)italic_C ( italic_X ). We refer to such a family as a basic family for the feedforward neural networks with inputs from a topological space (TFNNs). If 𝒜(X)C(X)𝒜𝑋𝐶𝑋\mathcal{A}(X)\subset C(X)caligraphic_A ( italic_X ) ⊂ italic_C ( italic_X ) is a basic family, then the architecture of a single hidden layer TFNN can be described as follows:

  • Input Layer: This layer consists of an element xX𝑥𝑋x\in Xitalic_x ∈ italic_X, where X𝑋Xitalic_X is an arbitrary topological space.

  • Hidden layer: Each neuron in the hidden layer takes the input x𝑥xitalic_x from the input layer and applies a function f𝒜(X)𝑓𝒜𝑋f\in\mathcal{A}(X)italic_f ∈ caligraphic_A ( italic_X ) to x𝑥xitalic_x. This value is then multiplied by a weight w𝑤witalic_w. A shift θ𝜃\thetaitalic_θ and then a fixed activation function σ::𝜎absent\sigma:italic_σ : \mathbb{R\rightarrow R}blackboard_R → blackboard_R are applied to f(x)𝑓𝑥f(x)italic_f ( italic_x ). The resulting value σ(wf(x)θ)𝜎𝑤𝑓𝑥𝜃\sigma(wf(x)-\theta)italic_σ ( italic_w italic_f ( italic_x ) - italic_θ ) represents the output signal of the neuron.

  • Output layer: Each neuron in this layer receives weighted signals from each neuron in the hidden layer, sums them up, and produces the final output value.

This architecture significantly extends the traditional feedforward neural networks. When X=d𝑋superscript𝑑X=\mathbb{R}^{d}italic_X = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝒜(X)=(d)𝒜𝑋superscript𝑑\mathcal{A}(X)=\mathcal{L}(\mathbb{R}^{d})caligraphic_A ( italic_X ) = caligraphic_L ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), the input x𝑥xitalic_x represents a d𝑑ditalic_d-dimensional vector. In this very special case, the layer contains d𝑑ditalic_d traditional neurons, each receiving an input signal x1,x2,,xdsubscript𝑥1subscript𝑥2subscript𝑥𝑑x_{1},x_{2},...,x_{d}\in\mathbb{R}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R, respectively. Note that in the aforementioned architecture, the element xX𝑥𝑋x\in Xitalic_x ∈ italic_X carries all the information of the input layer. This structure enables the network to accommodate a wide diversity of input types. In general, a single hidden layer TFNN computes a function of the form

i=1rciσ(wifi(x)θi),superscriptsubscript𝑖1𝑟subscript𝑐𝑖𝜎subscript𝑤𝑖subscript𝑓𝑖𝑥subscript𝜃𝑖\sum_{i=1}^{r}c_{i}\sigma(w_{i}f_{i}(x)-\theta_{i}),∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (1.2)1.2( 1.2 )

where xX𝑥𝑋x\in Xitalic_x ∈ italic_X is the input, fi𝒜(X),subscript𝑓𝑖𝒜𝑋f_{i}\in\mathcal{A}(X),italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_A ( italic_X ) , ci,wi,θisubscript𝑐𝑖subscript𝑤𝑖subscript𝜃𝑖c_{i},w_{i},\theta_{i}\in\mathbb{R}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R are the parameters of the network, and σ::𝜎absent\sigma:italic_σ : \mathbb{R\rightarrow R}blackboard_R → blackboard_R is a fixed activation function.

The aim of this paper is to show that for a broad a class activation functions σ𝜎\sigmaitalic_σ, neural networks of the form presented in (1.2) can approximate any continuous function on a compact subset KX𝐾𝑋K\subset Xitalic_K ⊂ italic_X with arbitrary precision. In other words, the set

𝒩(σ)=span{σ(wf(x)θ):f𝒜(X);w,θ}𝒩𝜎𝑠𝑝𝑎𝑛conditional-set𝜎𝑤𝑓𝑥𝜃formulae-sequence𝑓𝒜𝑋𝑤𝜃\mathcal{N}(\sigma)=span\{\sigma(wf(x)-\theta):f\in\mathcal{A}(X);~{}w,\theta% \in\mathbb{R}\}caligraphic_N ( italic_σ ) = italic_s italic_p italic_a italic_n { italic_σ ( italic_w italic_f ( italic_x ) - italic_θ ) : italic_f ∈ caligraphic_A ( italic_X ) ; italic_w , italic_θ ∈ blackboard_R }

is dense in C(K)𝐶𝐾C(K)italic_C ( italic_K ) for every compact set KX𝐾𝑋K\subset Xitalic_K ⊂ italic_X. As an application of this result, we will derive an approximative version of the Kolmogorov superposition theorem (KST) for compact metric spaces, where outer (non-fixed and generally nonsmooth) functions are substituted with a fixed ultimately smooth function.

It should be noted that the UAP of neural networks operating between Banach spaces has been explored in various studies. For example, in [24], the fundamentality of ridge functions was established in a Banach space and subsequently applied to shallow networks with a sigmoidal activation function (see also [15]). In [2], the authors showed that any continuous nonlinear function mapping a compact set V𝑉Vitalic_V in a Banach space of continuous functions C(K1)𝐶subscript𝐾1C(K_{1})italic_C ( italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) into C(K2)𝐶subscript𝐾2C(K_{2})italic_C ( italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) can be approximated arbitrarily well by shallow feedforward neural networks. Here K1subscript𝐾1K_{1}italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and K2subscript𝐾2K_{2}italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represent two compact sets in an abstract Banach space X𝑋Xitalic_X and the Euclidean space dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, respectively. In [16], this approach was extended to deep neural networks and referred to as DeepONet. In [13], the authors examined DeepONet within the context of an encoder-decoder network framework, investigating its approximation properties when the input space is a Hilbert space. In [11], quantitative estimates (i.e., convergence rates) for the approximation of nonlinear operators using single-hidden layer networks in infinite-dimensional Banach spaces were provided, extending some previous results from the finite-dimensional case.

The UAP of infinite-dimensional neural networks, with inputs from Fréchet spaces and outputs from Banach spaces, was established in [1]. In [3], the scope of this architecture was extended by proving several universal approximation theorems for quasi-Polish input and output spaces.

In [12], universal approximation theorems were obtained for neural operators (NOs) and mixtures of neural operators (MoNOs) acting between Sobolev spaces. More precisely, it was shown that any non-linear continuous operator acting between Sobolev spaces Hs1superscript𝐻subscript𝑠1H^{s_{1}}italic_H start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and Hs2superscript𝐻subscript𝑠2H^{s_{2}}italic_H start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT can be uniformly approximated over any compact set K𝐾absentK\subsetitalic_K ⊂ Hs1superscript𝐻subscript𝑠1H^{s_{1}}italic_H start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with arbitrary accuracy ε𝜀\varepsilonitalic_ε using NOs and MoNOs: Hs1Hs2superscript𝐻subscript𝑠1superscript𝐻subscript𝑠2H^{s_{1}}\rightarrow H^{s_{2}}italic_H start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → italic_H start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Moreover, the quantitative results of [12] estimate the depth, width, and rank of the neural operators in terms of the radius of K𝐾Kitalic_K and ε𝜀\varepsilonitalic_ε.

Recent research has demonstrated the universal approximation theorem (UAT) for various hypercomplex-valued neural networks, including complex-, quaternion-, tessarine-, and Clifford-valued networks, as well as more general vector-valued neural networks (V-nets) defined over a finite-dimensional algebra (see [25] and references therein). We hope that the results of this paper will stimulate further exploration of these neural networks, particularly with outputs from these and other general spaces.



2. Main results

In this section, we analyze the conditions under which shallow networks with inputs from a topological space possess the universal approximation property.

Assume X𝑋Xitalic_X is an arbitrary topological space. In the sequel, in C(X)𝐶𝑋C(X)italic_C ( italic_X ), we will use the topology of uniform convergence on compact sets. This topology is induced by the seminorms

gK=maxxK|g(x)|,subscriptnorm𝑔𝐾subscript𝑥𝐾𝑔𝑥\left\|g\right\|_{K}=\max_{x\in K}\left|g(x)\right|,∥ italic_g ∥ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_x ∈ italic_K end_POSTSUBSCRIPT | italic_g ( italic_x ) | ,

where K𝐾Kitalic_K are compact sets in X𝑋Xitalic_X. A subbasis at the origin for this topology is given by the sets

U(K,r)={gC(X):gK<r},𝑈𝐾𝑟conditional-set𝑔𝐶𝑋subscriptnorm𝑔𝐾𝑟U(K,r)=\left\{g\in C(X):\left\|g\right\|_{K}<r\right\},italic_U ( italic_K , italic_r ) = { italic_g ∈ italic_C ( italic_X ) : ∥ italic_g ∥ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT < italic_r } ,

where KX𝐾𝑋K\subset Xitalic_K ⊂ italic_X is compact and r>0𝑟0r>0italic_r > 0. A sequence (or net) {gn}subscript𝑔𝑛\{g_{n}\}{ italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } in this topology converges to g𝑔gitalic_g iff gngK0subscriptnormsubscript𝑔𝑛𝑔𝐾0\left\|g_{n}-g\right\|_{K}\rightarrow 0∥ italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_g ∥ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT → 0 for every compact set KX𝐾𝑋K\subset Xitalic_K ⊂ italic_X. Thus, in what follows, when we say that B𝐵Bitalic_B is dense in C(X)𝐶𝑋C(X)italic_C ( italic_X ), we will mean that B𝐵Bitalic_B is dense with respect to the aforementioned topology of uniform convergence on compact sets.

We say that a subclass 𝒜(X)C(X)𝒜𝑋𝐶𝑋\mathcal{A}(X)\subset C(X)caligraphic_A ( italic_X ) ⊂ italic_C ( italic_X ) holds the D𝐷Ditalic_D-property if the set

S=span{uv:uC(),v𝒜(X)}𝑆𝑠𝑝𝑎𝑛conditional-set𝑢𝑣formulae-sequence𝑢𝐶𝑣𝒜𝑋S=span\left\{u\circ v:u\in C(\mathbb{R}),v\in\mathcal{A}(X)\right\}italic_S = italic_s italic_p italic_a italic_n { italic_u ∘ italic_v : italic_u ∈ italic_C ( blackboard_R ) , italic_v ∈ caligraphic_A ( italic_X ) } (2.1)2.1( 2.1 )

is dense in C(X)𝐶𝑋C(X)italic_C ( italic_X ).

In what follows, we will use activation functions σ::𝜎\sigma:\mathbb{R}\rightarrow\mathbb{R}italic_σ : blackboard_R → blackboard_R (whether continuous or discontinuous) with the property that the span{σ(wxθ):w,θ}𝑠𝑝𝑎𝑛conditional-set𝜎𝑤𝑥𝜃formulae-sequence𝑤𝜃span\{\sigma(wx-\theta):w\in\mathbb{R},\theta\in\mathbb{R}\}italic_s italic_p italic_a italic_n { italic_σ ( italic_w italic_x - italic_θ ) : italic_w ∈ blackboard_R , italic_θ ∈ blackboard_R } is dense in every C[a,b]𝐶𝑎𝑏C[a,b]italic_C [ italic_a , italic_b ]. Such functions are called Tauber-Wiener (TW) functions (see [2]).


Theorem 2.1. Assume X𝑋Xitalic_X is a topological space, 𝒜(X)𝒜𝑋\mathcal{A}(X)caligraphic_A ( italic_X ) is a subclass of C(X)𝐶𝑋C(X)italic_C ( italic_X ) with the D𝐷Ditalic_D-property and σ::𝜎\sigma:\mathbb{R}\rightarrow\mathbb{R}italic_σ : blackboard_R → blackboard_R is a TW function. Then for any ε>0𝜀0\varepsilon>0italic_ε > 0, any compact set KX𝐾𝑋K\subset Xitalic_K ⊂ italic_X and any function gC(K)𝑔𝐶𝐾g\in C(K)italic_g ∈ italic_C ( italic_K ) there exist r𝑟r\in\mathbb{N}italic_r ∈ blackboard_N, fi𝒜(X)subscript𝑓𝑖𝒜𝑋f_{i}\in\mathcal{A}(X)italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_A ( italic_X ), ci,wi,θi,c_{i},w_{i},\theta_{i},\in\mathbb{R}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∈ blackboard_R, i=1,,r𝑖1𝑟i=1,...,ritalic_i = 1 , … , italic_r, such that

maxxK|g(x)i=1rciσ(wifi(x)θi)|<ε.subscript𝑥𝐾𝑔𝑥superscriptsubscript𝑖1𝑟subscript𝑐𝑖𝜎subscript𝑤𝑖subscript𝑓𝑖𝑥subscript𝜃𝑖𝜀\max_{x\in K}\left|g(x)-\sum_{i=1}^{r}c_{i}\sigma(w_{i}f_{i}(x)-\theta_{i})% \right|<\varepsilon.roman_max start_POSTSUBSCRIPT italic_x ∈ italic_K end_POSTSUBSCRIPT | italic_g ( italic_x ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | < italic_ε .

That is, TFNNs with inputs from X𝑋Xitalic_X is dense in C(X)𝐶𝑋C(X)italic_C ( italic_X ).


Proof. Take any ε>0𝜀0\varepsilon>0italic_ε > 0, any compact set KX𝐾𝑋K\subset Xitalic_K ⊂ italic_X and any function gC(K)𝑔𝐶𝐾g\in C(K)italic_g ∈ italic_C ( italic_K ). Since 𝒜(X)𝒜𝑋\mathcal{A}(X)caligraphic_A ( italic_X ) has the D𝐷Ditalic_D-property, there exist finitely many functions uiC()subscript𝑢𝑖𝐶u_{i}\in C(\mathbb{R})italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C ( blackboard_R ) and viC(X)subscript𝑣𝑖𝐶𝑋v_{i}\in C(X)italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C ( italic_X ) such that

|g(x)i=1nui(vi(x))|<ε/2,𝑔𝑥superscriptsubscript𝑖1𝑛subscript𝑢𝑖subscript𝑣𝑖𝑥𝜀2\left|g(x)-\sum_{i=1}^{n}u_{i}(v_{i}(x))\right|<\varepsilon/2,| italic_g ( italic_x ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) | < italic_ε / 2 , (2.2)2.2( 2.2 )

for all xK𝑥𝐾x\in Kitalic_x ∈ italic_K.

Since visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are continuous, the images vi(K)subscript𝑣𝑖𝐾v_{i}(K)italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_K ) are compact sets in \mathbb{R}blackboard_R. Set V=i=1nvi(K)𝑉superscriptsubscript𝑖1𝑛subscript𝑣𝑖𝐾V=\cup_{i=1}^{n}v_{i}(K)italic_V = ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_K ). Note that V𝑉Vitalic_V is also compact.

Since σ𝜎\sigmaitalic_σ is a TW function, each continuous univariate function ui(t)subscript𝑢𝑖𝑡u_{i}(t)italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ), t𝑡titalic_t Vabsent𝑉\in V∈ italic_V, can be approximated by single hidden layer networks with the activation function σ𝜎\sigmaitalic_σ. Thus, there exist coefficients cij,wij,θijsubscript𝑐𝑖𝑗subscript𝑤𝑖𝑗subscript𝜃𝑖𝑗c_{ij},w_{ij},\theta_{ij}\in\mathbb{R}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ blackboard_R, 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n, 1jki1𝑗subscript𝑘𝑖1\leq j\leq k_{i}1 ≤ italic_j ≤ italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, such that

|ui(t)j=1kicijσ(wijtθij)|<ε/2nsubscript𝑢𝑖𝑡superscriptsubscript𝑗1subscript𝑘𝑖subscript𝑐𝑖𝑗𝜎subscript𝑤𝑖𝑗𝑡subscript𝜃𝑖𝑗𝜀2𝑛\left|u_{i}(t)-\sum_{j=1}^{k_{i}}c_{ij}\sigma(w_{ij}t-\theta_{ij})\right|<% \varepsilon/2n| italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_σ ( italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_t - italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) | < italic_ε / 2 italic_n

for all tV𝑡𝑉t\in Vitalic_t ∈ italic_V. Therefore,

|ui(vi(x))j=1kicijσ(wijvi(x)θij)|<ε/2nsubscript𝑢𝑖subscript𝑣𝑖𝑥superscriptsubscript𝑗1subscript𝑘𝑖subscript𝑐𝑖𝑗𝜎subscript𝑤𝑖𝑗subscript𝑣𝑖𝑥subscript𝜃𝑖𝑗𝜀2𝑛\left|u_{i}(v_{i}(x))-\sum_{j=1}^{k_{i}}c_{ij}\sigma(w_{ij}v_{i}(x)-\theta_{ij% })\right|<\varepsilon/2n| italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ) - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_σ ( italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) | < italic_ε / 2 italic_n (2.3)2.3( 2.3 )

for each i=1,,n𝑖1𝑛i=1,...,nitalic_i = 1 , … , italic_n, and all xK𝑥𝐾x\in Kitalic_x ∈ italic_K. It follows from (2.2) and (2.3) that

|g(x)i=1nj=1kicijσ(wijvi(x)θij)|<ε𝑔𝑥superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1subscript𝑘𝑖subscript𝑐𝑖𝑗𝜎subscript𝑤𝑖𝑗subscript𝑣𝑖𝑥subscript𝜃𝑖𝑗𝜀\left|g(x)-\sum_{i=1}^{n}\sum_{j=1}^{k_{i}}c_{ij}\sigma(w_{ij}v_{i}(x)-\theta_% {ij})\right|<\varepsilon| italic_g ( italic_x ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_σ ( italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - italic_θ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) | < italic_ε

for any xK𝑥𝐾x\in Kitalic_x ∈ italic_K. This completes the proof of Theorem 2.1.


Remark. Theorem 2.1 generalizes existing universal approximation theorems for traditional feedforward neural networks. This is because in traditional networks, the space of linear continuous functionals on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT serves as the basic family 𝒜(X)𝒜𝑋\mathcal{A}(X)caligraphic_A ( italic_X ), which clearly satisfies the D𝐷Ditalic_D-property.


Note that, in particular, X𝑋Xitalic_X may be a topological vector space. For such a space, Xsuperscript𝑋X^{\ast}italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the continuous dual of X𝑋Xitalic_X, which is the space of linear continuous functionals defined on X𝑋Xitalic_X. The following theorem is based on Theorem 2.1.


Theorem 2.2. Assume X𝑋Xitalic_X is a locally convex topological vector space (in particular, a normed space) and σ𝜎\sigmaitalic_σ is a continuous univariate function that is not a polynomial. Then for any ε>0𝜀0\varepsilon>0italic_ε > 0, any compact set KX𝐾𝑋K\subset Xitalic_K ⊂ italic_X and any function gC(K)𝑔𝐶𝐾g\in C(K)italic_g ∈ italic_C ( italic_K ) there exist r𝑟r\in\mathbb{N}italic_r ∈ blackboard_N, fiXsubscript𝑓𝑖superscript𝑋f_{i}\in X^{\ast}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, ci,θisubscript𝑐𝑖subscript𝜃𝑖c_{i},\theta_{i}\in\mathbb{R}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R, i=1,,r𝑖1𝑟i=1,...,ritalic_i = 1 , … , italic_r, such that

maxxK|g(x)i=1rciσ(fi(x)θi)|<ε.subscript𝑥𝐾𝑔𝑥superscriptsubscript𝑖1𝑟subscript𝑐𝑖𝜎subscript𝑓𝑖𝑥subscript𝜃𝑖𝜀\max_{x\in K}\left|g(x)-\sum_{i=1}^{r}c_{i}\sigma(f_{i}(x)-\theta_{i})\right|<\varepsilon.roman_max start_POSTSUBSCRIPT italic_x ∈ italic_K end_POSTSUBSCRIPT | italic_g ( italic_x ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | < italic_ε .

The proof of this theorem relies on Theorem 2.1 and the following two facts.

Fact 1. The space Xsuperscript𝑋X^{\ast}italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT possesses the D𝐷Ditalic_D-property.

Let us prove this fact. Specifically, this property holds if in (2.1), instead of all uC()𝑢𝐶u\in C(\mathbb{R})italic_u ∈ italic_C ( blackboard_R ), we take the single function u(t)=et𝑢𝑡superscript𝑒𝑡u(t)=e^{t}italic_u ( italic_t ) = italic_e start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. That is, we claim that the set

S=span{er(x):rX}.𝑆𝑠𝑝𝑎𝑛conditional-setsuperscript𝑒𝑟𝑥𝑟superscript𝑋S=span\{e^{r(x)}:r\in X^{\ast}\}.italic_S = italic_s italic_p italic_a italic_n { italic_e start_POSTSUPERSCRIPT italic_r ( italic_x ) end_POSTSUPERSCRIPT : italic_r ∈ italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } .

is dense in C(K)𝐶𝐾C(K)italic_C ( italic_K ) for every compact set KX𝐾𝑋K\subset Xitalic_K ⊂ italic_X.

Indeed, first it is not difficult to see that S𝑆Sitalic_S is a subalgebra of C(X)𝐶𝑋C(X)italic_C ( italic_X ). To see this, note that for any r1,r2Xsubscript𝑟1subscript𝑟2superscript𝑋r_{1},r_{2}\in X^{\ast}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

er1(x)er2(x)=er1(x)+r2(x)E,superscript𝑒subscript𝑟1𝑥superscript𝑒subscript𝑟2𝑥superscript𝑒subscript𝑟1𝑥subscript𝑟2𝑥𝐸,e^{r_{1}(x)}e^{r_{2}(x)}=e^{r_{1}(x)+r_{2}(x)}\in E\text{,}italic_e start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT = italic_e start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) + italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) end_POSTSUPERSCRIPT ∈ italic_E ,

since r1+r2Xsubscript𝑟1subscript𝑟2superscript𝑋r_{1}+r_{2}\in X^{\ast}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Therefore, the linear space S𝑆Sitalic_S is closed under multiplication, indicating that S𝑆Sitalic_S is an algebra.

Second, if r𝑟ritalic_r is the zero functional, then er(x)=1superscript𝑒𝑟𝑥1e^{r(x)}=1italic_e start_POSTSUPERSCRIPT italic_r ( italic_x ) end_POSTSUPERSCRIPT = 1, showing that S𝑆Sitalic_S contains all constant functions.

Now since X𝑋Xitalic_X is locally convex, the Hahn-Banach extension theorem holds. It is a consequence of this theorem that for any distinct points x1,x2Xsubscript𝑥1subscript𝑥2𝑋x_{1},x_{2}\in Xitalic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_X there exists a functional rX𝑟superscript𝑋r\in X^{\ast}italic_r ∈ italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that r(x1)r(x2)𝑟subscript𝑥1𝑟subscript𝑥2r(x_{1})\neq r(x_{2})italic_r ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≠ italic_r ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (see [22, Theorem 3.6]). Hence, the algebra S𝑆Sitalic_S separates points in X𝑋Xitalic_X. By the Stone-Weierstrass theorem [23], for any compact KX𝐾𝑋K\subset Xitalic_K ⊂ italic_X, the algebra S𝑆Sitalic_S restricted to K𝐾Kitalic_K is dense in C(K)𝐶𝐾C(K)italic_C ( italic_K ). In other words, the space Xsuperscript𝑋X^{\ast}italic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT has the D𝐷Ditalic_D-property.

Fact 2. A continuous nonpolynomial activation function σ𝜎\sigmaitalic_σ is a TW function.

This fact follows from the main result of [14] that a continuous nonpolynomial activation function provides the universal approximation property for traditional single hidden layer networks.


Let us now we apply Theorem 2.1 to derive an approximative version of the renowned Kolmogorov superposition theorem (KST) for compact metric spaces. KST [10] states that for the unit cube 𝕀d,𝕀=[0,1],d2,formulae-sequencesuperscript𝕀𝑑𝕀01𝑑2\mathbb{I}^{d},~{}\mathbb{I}=[0,1],~{}d\geq 2,blackboard_I start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , blackboard_I = [ 0 , 1 ] , italic_d ≥ 2 , there exist 2d+12𝑑12d+12 italic_d + 1 functions {sq}q=12d+1C(𝕀d)superscriptsubscriptsubscript𝑠𝑞𝑞12𝑑1𝐶superscript𝕀𝑑\{s_{q}\}_{q=1}^{2d+1}\subset C(\mathbb{I}^{d}){ italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_d + 1 end_POSTSUPERSCRIPT ⊂ italic_C ( blackboard_I start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) of the form

sq(x1,,xd)=p=1dφpq(xp),φpqC(𝕀),p=1,,d,q=1,,2d+1,formulae-sequencesubscript𝑠𝑞subscript𝑥1subscript𝑥𝑑superscriptsubscript𝑝1𝑑subscript𝜑𝑝𝑞subscript𝑥𝑝formulae-sequencesubscript𝜑𝑝𝑞𝐶𝕀formulae-sequence𝑝1𝑑𝑞12𝑑1s_{q}(x_{1},...,x_{d})=\sum_{p=1}^{d}\varphi_{pq}(x_{p}),~{}\varphi_{pq}\in C(% \mathbb{I}),~{}p=1,...,d,~{}q=1,...,2d+1,italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_φ start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) , italic_φ start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT ∈ italic_C ( blackboard_I ) , italic_p = 1 , … , italic_d , italic_q = 1 , … , 2 italic_d + 1 ,

such that each function fC(𝕀d)𝑓𝐶superscript𝕀𝑑f\in C(\mathbb{I}^{d})italic_f ∈ italic_C ( blackboard_I start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) admits the representation

f(𝐱)=q=12d+1gq(sq(𝐱)),𝐱=(x1,,xd)𝕀d,gqC().formulae-sequenceformulae-sequence𝑓𝐱superscriptsubscript𝑞12𝑑1subscript𝑔𝑞subscript𝑠𝑞𝐱𝐱subscript𝑥1subscript𝑥𝑑superscript𝕀𝑑subscript𝑔𝑞𝐶f(\mathbf{x})=\sum_{q=1}^{2d+1}g_{q}(s_{q}(\mathbf{x})),~{}\mathbf{x}=(x_{1},.% ..,x_{d})\in\mathbb{I}^{d},~{}g_{q}\in C({{\mathbb{R)}}}.italic_f ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_d + 1 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x ) ) , bold_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ∈ blackboard_I start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_g start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ italic_C ( blackboard_R ) .

This surprising and deep result, which solved (negatively) Hilbert’s 13-th problem, has been improved and generalized in several directions. For detailed information about KST, including its refinements, various variants, and generalizations, see the monographs [9, Chapter 1] and [7, Chapter 4]. The relevance of KST to neural networks, along with its theoretical and computational aspects, has been extensively discussed in the neural network literature (see, e.g., [8] and references therein).

Ostrand [18] extended KST to general compact metric spaces as follows.


Theorem 2.3. (Ostrand [18]). For p=1,2,,n𝑝12𝑛p=1,2,...,nitalic_p = 1 , 2 , … , italic_n, let Xpsubscript𝑋𝑝X_{p}italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT be a compact metric space of finite dimension dpsubscript𝑑𝑝d_{p}italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and let m=p=1ndp.𝑚superscriptsubscript𝑝1𝑛subscript𝑑𝑝m=\sum_{p=1}^{n}d_{p}.italic_m = ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT . There exist universal continuous functions ψpq:Xp[0,1],:subscript𝜓𝑝𝑞subscript𝑋𝑝01\psi_{pq}:X_{p}\rightarrow[0,1],italic_ψ start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT : italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT → [ 0 , 1 ] , p=1,,n,𝑝1𝑛p=1,...,n,italic_p = 1 , … , italic_n , q=1,,2m+1,𝑞12𝑚1q=1,...,2m+1,italic_q = 1 , … , 2 italic_m + 1 , such that every continuous function g𝑔gitalic_g defined on Πp=1nXpsuperscriptsubscriptΠ𝑝1𝑛subscript𝑋𝑝\Pi_{p=1}^{n}X_{p}roman_Π start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is representable in the form

g(x1,,xn)=q=12m+1hq(p=1nψpq(xp)),𝑔subscript𝑥1subscript𝑥𝑛superscriptsubscript𝑞12𝑚1subscript𝑞superscriptsubscript𝑝1𝑛subscript𝜓𝑝𝑞subscript𝑥𝑝g(x_{1},...,x_{n})=\sum_{q=1}^{2m+1}h_{q}(\sum_{p=1}^{n}\psi_{pq}(x_{p})),italic_g ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_m + 1 end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) ,

where hqsubscript𝑞h_{q}italic_h start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT are continuous functions depending on g𝑔gitalic_g.


It follows from this theorem that for the metric space X=Πp=1nXp𝑋superscriptsubscriptΠ𝑝1𝑛subscript𝑋𝑝X=\Pi_{p=1}^{n}X_{p}italic_X = roman_Π start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT the family of 2m+12𝑚12m+12 italic_m + 1 functions

𝒦(X)={p=1nψpq(xp):q=1,,2m+1}𝒦𝑋conditional-setsuperscriptsubscript𝑝1𝑛subscript𝜓𝑝𝑞subscript𝑥𝑝𝑞12𝑚1\mathcal{K}(X)=\left\{\sum_{p=1}^{n}\psi_{pq}(x_{p}):q=1,...,2m+1\right\}caligraphic_K ( italic_X ) = { ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) : italic_q = 1 , … , 2 italic_m + 1 }

satisfies the D𝐷Ditalic_D-property in C(X)𝐶𝑋C(X)italic_C ( italic_X ).

Let now σ𝜎\sigmaitalic_σ be a specific infinitely differentiable TW function with the property that, for any interval [a,b]𝑎𝑏[a,b][ italic_a , italic_b ], the set Λ={σ(wxθ):w,θ}Λconditional-set𝜎𝑤𝑥𝜃𝑤𝜃\Lambda=\{\sigma(wx-\theta):w,\theta\in\mathbb{R}\}roman_Λ = { italic_σ ( italic_w italic_x - italic_θ ) : italic_w , italic_θ ∈ blackboard_R } is dense in C[a,b]𝐶𝑎𝑏C[a,b]italic_C [ italic_a , italic_b ]. Note that this is not the linear span of the functions σ(wxθ)𝜎𝑤𝑥𝜃\sigma(wx-\theta)italic_σ ( italic_w italic_x - italic_θ ), but rather a very narrow subclass of it. Such functions σ𝜎\sigmaitalic_σ do indeed exist.

To show this, let α𝛼\alphaitalic_α be any positive real number. Divide the interval [α,+)𝛼[\alpha,+\infty)[ italic_α , + ∞ ) into the segments [α,2α],𝛼2𝛼[\alpha,2\alpha],[ italic_α , 2 italic_α ] , [2α,3α],2𝛼3𝛼[2\alpha,3\alpha],...[ 2 italic_α , 3 italic_α ] , …. Let {pn(t)}n=1superscriptsubscriptsubscript𝑝𝑛𝑡𝑛1\{p_{n}(t)\}_{n=1}^{\infty}{ italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t ) } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT be the sequence of polynomials with rational coefficients defined on [0,1].01[0,1].[ 0 , 1 ] . We construct σ𝜎\sigmaitalic_σ in two stages. In the first stage, we define σ𝜎\sigmaitalic_σ on the closed intervals [(2m1)α,2mα],2𝑚1𝛼2𝑚𝛼[(2m-1)\alpha,2m\alpha],[ ( 2 italic_m - 1 ) italic_α , 2 italic_m italic_α ] , m=1,2,𝑚12m=1,2,...italic_m = 1 , 2 , … as the function

σ(t)=pm(tα2m+1), t[(2m1)α,2mα],formulae-sequence𝜎𝑡subscript𝑝𝑚𝑡𝛼2𝑚1 𝑡2𝑚1𝛼2𝑚𝛼\sigma(t)=p_{m}(\frac{t}{\alpha}-2m+1),\text{ }t\in[(2m-1)\alpha,2m\alpha],italic_σ ( italic_t ) = italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( divide start_ARG italic_t end_ARG start_ARG italic_α end_ARG - 2 italic_m + 1 ) , italic_t ∈ [ ( 2 italic_m - 1 ) italic_α , 2 italic_m italic_α ] ,

or equivalently,

σ(αt+(2m1)α)=pm(t), t[0,1].formulae-sequence𝜎𝛼𝑡2𝑚1𝛼subscript𝑝𝑚𝑡 𝑡01\sigma(\alpha t+(2m-1)\alpha)=p_{m}(t),\text{ }t\in[0,1].italic_σ ( italic_α italic_t + ( 2 italic_m - 1 ) italic_α ) = italic_p start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) , italic_t ∈ [ 0 , 1 ] . (2.4)2.4( 2.4 )

In the second stage, we extend σ𝜎\sigmaitalic_σ to the intervals (2mα,(2m+1)α),2𝑚𝛼2𝑚1𝛼(2m\alpha,(2m+1)\alpha),( 2 italic_m italic_α , ( 2 italic_m + 1 ) italic_α ) , m=1,2,,𝑚12m=1,2,...,italic_m = 1 , 2 , … , and (,α)𝛼(-\infty,\alpha)( - ∞ , italic_α ), maintaining the Csuperscript𝐶C^{\infty}italic_C start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT property.

For any univariate function hC[0,1]𝐶01h\in C[0,1]italic_h ∈ italic_C [ 0 , 1 ] and any ε>0𝜀0\varepsilon>0italic_ε > 0 there exists a polynomial p(t)𝑝𝑡p(t)italic_p ( italic_t ) with rational coefficients such that

|h(t)p(t)|<ε,𝑡𝑝𝑡𝜀\left|h(t)-p(t)\right|<\varepsilon,| italic_h ( italic_t ) - italic_p ( italic_t ) | < italic_ε ,

for all t[0,1].𝑡01t\in[0,1].italic_t ∈ [ 0 , 1 ] . This together with (2.4) mean that

|h(t)σ(αts)|<ε,𝑡𝜎𝛼𝑡𝑠𝜀\left|h(t)-\sigma(\alpha t-s)\right|<\varepsilon,| italic_h ( italic_t ) - italic_σ ( italic_α italic_t - italic_s ) | < italic_ε , (2.5)2.5( 2.5 )

for some s𝑠s\in\mathbb{R}italic_s ∈ blackboard_R and all t[0,1].𝑡01t\in[0,1].italic_t ∈ [ 0 , 1 ] .

Using linear transformation it is not difficult to go from [0,1]01[0,1][ 0 , 1 ] to any finite closed interval [a,b]𝑎𝑏[a,b][ italic_a , italic_b ]. Indeed, let uC[a,b]𝑢𝐶𝑎𝑏u\in C[a,b]italic_u ∈ italic_C [ italic_a , italic_b ], σ𝜎\sigmaitalic_σ be constructed as above and ε𝜀\varepsilonitalic_ε be an arbitrarily small positive number. The transformed function h(t)=u(a+(ba)t)𝑡𝑢𝑎𝑏𝑎𝑡h(t)=u(a+(b-a)t)italic_h ( italic_t ) = italic_u ( italic_a + ( italic_b - italic_a ) italic_t ) is well defined on [0,1]01[0,1][ 0 , 1 ] and we can apply the inequality (2.5). Now using the inverse transformation t=xaba𝑡𝑥𝑎𝑏𝑎t=\frac{x-a}{b-a}italic_t = divide start_ARG italic_x - italic_a end_ARG start_ARG italic_b - italic_a end_ARG, we can write that

|u(x)σ(wxθ)|<ε,𝑢𝑥𝜎𝑤𝑥𝜃𝜀\left|u(x)-\sigma(wx-\theta)\right|<\varepsilon,| italic_u ( italic_x ) - italic_σ ( italic_w italic_x - italic_θ ) | < italic_ε , (2.6)2.6( 2.6 )

for all x[a,b]𝑥𝑎𝑏x\in[a,b]italic_x ∈ [ italic_a , italic_b ], where w=αba𝑤𝛼𝑏𝑎w=\frac{\alpha}{b-a}italic_w = divide start_ARG italic_α end_ARG start_ARG italic_b - italic_a end_ARG and θ=αaba+s𝜃𝛼𝑎𝑏𝑎𝑠\theta=\frac{\alpha a}{b-a}+sitalic_θ = divide start_ARG italic_α italic_a end_ARG start_ARG italic_b - italic_a end_ARG + italic_s.

We define activation functions σ𝜎\sigmaitalic_σ as superactivation functions if they satisfy (2.6) for any uC[a,b]𝑢𝐶𝑎𝑏u\in C[a,b]italic_u ∈ italic_C [ italic_a , italic_b ], ε>0𝜀0\varepsilon>0italic_ε > 0, and some w,θ𝑤𝜃w,\theta\in\mathbb{R}italic_w , italic_θ ∈ blackboard_R. These functions demonstrate that shallow networks can approximate univariate continuous functions with the minimal number of hidden neurons; in fact, a single hidden neuron is sufficient. Similar activation functions σ𝜎\sigmaitalic_σ, with additional properties of monotonicity and sigmoidality, were algorithmically constructed in [5] and utilized in practical examples. It should be remarked that the existence of activation functions that ensure universal approximation for single and two hidden layer neural networks with a fixed number of hidden units was first established in [17].

If in Theorem 2.1, we take 𝒜(X)=𝒦(X)𝒜𝑋𝒦𝑋\mathcal{A}(X)=\mathcal{K}(X)caligraphic_A ( italic_X ) = caligraphic_K ( italic_X ) and any superactivation function σ𝜎\sigmaitalic_σ, then the number of terms r𝑟ritalic_r will be 2m+12𝑚12m+12 italic_m + 1. To see this, it is sufficient to repeat the proof, noting that n=2m+1𝑛2𝑚1n=2m+1italic_n = 2 italic_m + 1 and k=1𝑘1k=1italic_k = 1. This observation leads to the following theorem.


Theorem 2.4. For p=1,2,,n𝑝12𝑛p=1,2,...,nitalic_p = 1 , 2 , … , italic_n, let Xpsubscript𝑋𝑝X_{p}italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT be a compact metric space of finite dimension dpsubscript𝑑𝑝d_{p}italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and let m=p=1ndp.𝑚superscriptsubscript𝑝1𝑛subscript𝑑𝑝m=\sum_{p=1}^{n}d_{p}.italic_m = ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT . There exist universal continuous functions ψpq:Xp[0,1],:subscript𝜓𝑝𝑞subscript𝑋𝑝01\psi_{pq}:X_{p}\rightarrow[0,1],italic_ψ start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT : italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT → [ 0 , 1 ] , p=1,,n,𝑝1𝑛p=1,...,n,italic_p = 1 , … , italic_n , q=1,,2m+1,𝑞12𝑚1q=1,...,2m+1,italic_q = 1 , … , 2 italic_m + 1 , and an infinitely differentiable function σ::𝜎\sigma:\mathbb{R}\rightarrow\mathbb{R}italic_σ : blackboard_R → blackboard_R such that for every continuous function g𝑔gitalic_g defined on X=Πp=1nXp𝑋superscriptsubscriptΠ𝑝1𝑛subscript𝑋𝑝X=\Pi_{p=1}^{n}X_{p}italic_X = roman_Π start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and any ε>0𝜀0\varepsilon>0italic_ε > 0 there exist wq,θq,w_{q},\theta_{q},\in\mathbb{R}italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , ∈ blackboard_R, q=1,,2m+1𝑞12𝑚1q=1,...,2m+1italic_q = 1 , … , 2 italic_m + 1, such that

|g(x1,,xn)q=12m+1σ(wqp=1nψpq(xp)θq)|<ε,𝑔subscript𝑥1subscript𝑥𝑛superscriptsubscript𝑞12𝑚1𝜎subscript𝑤𝑞superscriptsubscript𝑝1𝑛subscript𝜓𝑝𝑞subscript𝑥𝑝subscript𝜃𝑞𝜀\left|g(x_{1},...,x_{n})-\sum_{q=1}^{2m+1}\sigma(w_{q}\sum_{p=1}^{n}\psi_{pq}(% x_{p})-\theta_{q})\right|<\varepsilon,| italic_g ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_m + 1 end_POSTSUPERSCRIPT italic_σ ( italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) - italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) | < italic_ε ,

for all (x1,,xn)Xsubscript𝑥1subscript𝑥𝑛𝑋(x_{1},...,x_{n})\in X( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ italic_X.


Note that in Theorem 2.4 the outer function σ𝜎\sigmaitalic_σ does not depend on g𝑔gitalic_g. The only parameters that depend on g𝑔gitalic_g are the numbers θqsubscript𝜃𝑞\theta_{q}italic_θ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. The numbers wqsubscript𝑤𝑞w_{q}italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT can be taken to be equal and fixed once and for all. This is evident from the construction of σ𝜎\sigmaitalic_σ above (see (2.6), where w𝑤witalic_w is fixed for all u𝑢uitalic_u). For example, if we set α=ba𝛼𝑏𝑎\alpha=b-aitalic_α = italic_b - italic_a, where [a,b]𝑎𝑏[a,b][ italic_a , italic_b ] is a closed interval containing all the sets Ψq(X),subscriptΨ𝑞𝑋\Psi_{q}(X),roman_Ψ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_X ) , where Ψq(x1,,xn)=p=1nψpq(xp)subscriptΨ𝑞subscript𝑥1subscript𝑥𝑛superscriptsubscript𝑝1𝑛subscript𝜓𝑝𝑞subscript𝑥𝑝\Psi_{q}(x_{1},...,x_{n})=\sum_{p=1}^{n}\psi_{pq}(x_{p})roman_Ψ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT italic_p italic_q end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ), q=1,,2m+1𝑞12𝑚1q=1,...,2m+1italic_q = 1 , … , 2 italic_m + 1, then wqsubscript𝑤𝑞w_{q}italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT can all be taken to be equal to 1111.


References

  • [1] F. E. Benth, N. Detering, and L. Galimberti, Neural networks in Fréchet spaces, Ann. Math. Artif. Intell. 91 (2023), 75-103.
  • [2] T. Chen and H. Chen, Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems, IEEE Trans. Neural Netw. 6 (1995), no. 4, 911-917.
  • [3] L. Galimberti, Neural networks in non-metric spaces, arXiv preprint, arXiv:2406.09310 [math.FA], 2024.
  • [4] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, MIT Press, Cambridge, MA, 2016.
  • [5] N. J. Guliyev and V. E. Ismailov, On the approximation by single hidden layer feedforward neural networks with fixed weights, Neural Netw. 98 (2018), 296-304.
  • [6] N. J. Guliyev and V. E. Ismailov, Approximation capability of two hidden layer feedforward neural networks with fixed weights, Neurocomputing 316 (2018), 262-269.
  • [7] V. E. Ismailov, Ridge Functions and Applications in Neural Networks, Mathematical Surveys and Monographs, 263. American Mathematical Society, 2021.
  • [8] A. Ismayilova, V. E. Ismailov, On the Kolmogorov neural networks, Neural Netw. 176 (2024), Paper No. 106333.
  • [9] S. Ya. Khavinson, Best approximation by linear superpositions (approximate nomography), Translated from the Russian manuscript by D. Khavinson. Translations of Mathematical Monographs, 159. American Mathematical Society, Providence, RI, 1997, 175 pp.
  • [10] A. N. Kolmogorov, On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. (Russian), Dokl. Akad. Nauk SSSR 114 (1957), 953-956.
  • [11] Y. Korolev, Two-layer neural networks with values in a Banach space, SIAM J. Math. Anal. 54 (2022), no. 6, 6358-6389.
  • [12] A. Kratsios, T. Furuya, A. Lara, M. Lassas, M. de Hoop, Mixture of experts soften the curse of dimensionality in operator learning, arXiv preprint, arXiv:2404.09101 [cs.LG], 2024.
  • [13] S. Lanthaler, S. Mishra and G. E. Karniadakis, Error estimates for DeepONets: a deep learning framework in infinite dimensions, Trans. Math. Appl. 6 (2022), no. 1, tnac001, 141 pp.
  • [14] M. Leshno, V. Ya. Lin, A. Pinkus and S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Netw. 6 (1993), 861-867.
  • [15] W. Light, Ridge functions, sigmoidal functions and neural networks, Approximation theory VII (Austin, TX, 1992), 163-206, Academic Press, Boston, MA, 1993.
  • [16] L. Lu, P. Jin, G. Pang, Z. Zhang and G. E. Karniadakis, Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators, Nat. Mach. Intell. 3 (2021), no. 3, 218-229.
  • [17] V. Maiorov, A. Pinkus, Lower bounds for approximation by MLP neural networks, Neurocomputing 25 (1999), 81-91.
  • [18] P. A. Ostrand, Dimension of metric spaces and Hilbert’s problem $13$, Bull. Amer. Math. Soc. 71 (1965), 619-622.
  • [19] P. Petersen and J. Zech, Mathematical theory of deep learning, arXiv preprint, arXiv:2407.18384 [cs.LG], 2024.
  • [20] A. Pinkus, Approximation theory of the MLP model in neural networks, Acta Numerica 8 (1999), 143-195.
  • [21] S. Roman, Advanced linear algebra, Third edition. Graduate Texts in Mathematics, 135. Springer, New York, 2008.
  • [22] W. Rudin, Functional analysis, Second edition. International Series in Pure and Applied Mathematics. McGraw-Hill, Inc., New York, 1991, 424 pp.
  • [23] M. H. Stone, The generalized Weierstrass approximation theorem, Math. Mag. 21 (1948), 167-184, 237-254.
  • [24] X. Sun, E. W. Cheney, The fundamentality of sets of ridge functions, Aequationes Math. 44 (1992), no. 2-3, 226-235.
  • [25] M. E. Valle, W. L. Vital and G. Vieira, Universal approximation theorem for vector- and hypercomplex-valued neural networks, Neural Netw. 180 (2024), 106632.