References

On shallow feedforward neural networks with inputs from a topological space

Vugar E. Ismailov

Institute of Mathematics and Mechanics, Baku, Azerbaijan

Center for Mathematics and its Applications, Khazar University, Baku, Azerbaijan

Abstract. We study feedforward neural networks with inputs from a topological space (TFNNs). We prove a universal approximation theorem for shallow TFNNs, which demonstrates their capacity to approximate any continuous function defined on this topological space. As an application, we obtain an approximative version of Kolmogorov’s superposition theorem for compact metric spaces.

Mathematics Subject Classifications: 41A30, 41A65, 68T05

Keywords: feedforward neural network, universal approximation theorem, density, topological vector space, Tauber-Wiener function, Kolmogorov’s superposition theorem

1. Introduction

Neural networks are fundamental to contemporary machine learning and artificial intelligence, providing robust methods for tackling intricate challenges. Among the different neural network designs, the multilayer feedforward perceptron (MLP) is particularly prominent and essential. The MLP is valued for its capability to model complex, nonlinear functions and execute various tasks, including classification, regression, and pattern recognition.

This architecture consists of a limited number of sequential layers: an input layer at the beginning, an output layer at the end, and several hidden layers in between. Information progresses from the input layer through the hidden layers to the output layer. In this framework, each neuron in a layer receives inputs from the previous layer, applies specific weights, adds a bias, and then processes the result through an activation function. This activation function introduces non-linearity, allowing the model to learn and capture intricate patterns. The output from one layer’s neurons serves as the input for the neurons in the next layer, continuing this sequence until the final output is generated by the output layer.

The most basic form of an MLP features just one hidden layer. In this setup, each output neuron calculates a function expressed as

\sum_{i=1}^{r}c_{i}\sigma(\mathbf{w}^{i}\cdot\mathbf{x}-\theta_{i}),

(1.1)

where $\mathbf{x}=(x_{1},...,x_{d})$ represents the input vector, $r$ is the number of neurons in the hidden layer, $\mathbf{w}^{i}$ are weight vectors in $\mathbb{R}^{d}$ , $\theta_{i}$ are thresholds, $c_{i}$ are coefficients, and $\sigma$ is the activation function, a real univariate function.

The theoretical underpinning of neural networks is rooted in the universal approximation property (UAP), sometimes referred to as the density property. This principle states that a neural network with a single hidden layer can approximate any continuous function over a compact domain to any desired level of precision. Specifically, the set $span\{\sigma(\mathbf{w\cdot x}-\theta):\theta\in\mathbb{R},\mathbf{w}\in% \mathbb{R}^{d}\}$ , which comprises functions defined in the format of equation (1.1), is dense in $C(K)$ for every compact set $K\subset\mathbb{R}^{d}$ . Here $C(K)$ represents the space of real-valued continuous functions on $K$ . This important result in neural network theory is known as the universal approximation theorem (UAT).

Extensive research has investigated the UAT across various activation functions $\sigma$ , examining how different choices influence the approximation capabilities of neural networks. The most general result in this area was obtained by Leshno, Lin, Pinkus and Schocken [14]. They proved that a continuous activation function $\sigma$ possesses the UAP if and only if it is not a polynomial. This result demonstrates the effectiveness of the single hidden layer perceptron model across a wide range of activation functions $\sigma$ . It should be noted that, the universal approximation theorem in [14] was shown to apply to a broader class of activation functions beyond just continuous ones, including activation functions that may have discontinuities on sets of Lebesgue measure zero. However, this paper will specifically concentrate on continuous activation functions. For a thorough, step-by-step proof of this theorem, refer to [19, 20].

In the past, it was commonly accepted and highlighted in numerous studies that attaining the universal approximation property necessitate large networks with a substantial number of hidden neurons (see, e.g., [4, Chapter 6.4.1]). In the above-mentioned earlier works, the number of hidden neurons was regarded as unbounded. However, more recent research [5, 6, 7] has demonstrated that neural networks using certain non-explicit but practically computable activation functions can approximate any continuous function over any compact set to any desired level of accuracy, even with a minimal and fixed number of hidden neurons.

Note that the inner product $\mathbf{w}^{i}\cdot\mathbf{x}$ in (1.1) represents a linear continuous functional on $\mathbb{R}^{d}$ . Conversely, by Riesz representation theorem, every linear functional on $\mathbb{R}^{d}$ is of the form $\mathbf{w\cdot x}$ , where $\mathbf{w}\in\mathbb{R}^{d}$ and $\mathbf{x}=(x_{1},...,x_{d})$ is the variable (see [21, Theorem 13.32]). Linear continuous functionals constitute a significant subclass in $C(\mathbb{R}^{d})$ , denoted here by $\mathcal{L}(\mathbb{R}^{d})$ . Thus, UAT asserts that for certain activation functions $\sigma$ and any compact set $K\subset\mathbb{R}^{d}$ , the set

\mathcal{M}(\sigma)=span\{\sigma(f(x)-\theta):f\in\mathcal{L}(\mathbb{R}^{d}),% \theta\in\mathbb{R}\}

is dense in $C(K)$ . This observation tells the following generalization of single hidden layer networks from $\mathbb{R}^{d}$ to any topological space $X,$ where $\mathcal{L}(\mathbb{R}^{d})$ is replaced with a fixed family of functions (which need not be linear) from $C(X)$ . We refer to such a family as a basic family for the feedforward neural networks with inputs from a topological space (TFNNs). If $\mathcal{A}(X)\subset C(X)$ is a basic family, then the architecture of a single hidden layer TFNN can be described as follows:

•

Input Layer: This layer consists of an element $x\in X$ , where $X$ is an arbitrary topological space.
•

Hidden layer: Each neuron in the hidden layer takes the input $x$ from the input layer and applies a function $f\in\mathcal{A}(X)$ to $x$ . This value is then multiplied by a weight $w$ . A shift $\theta$ and then a fixed activation function $\sigma:$ $\mathbb{R\rightarrow R}$ are applied to $f(x)$ . The resulting value $\sigma(wf(x)-\theta)$ represents the output signal of the neuron.
•

Output layer: Each neuron in this layer receives weighted signals from each neuron in the hidden layer, sums them up, and produces the final output value.

This architecture significantly extends the traditional feedforward neural networks. When $X=\mathbb{R}^{d}$ and $\mathcal{A}(X)=\mathcal{L}(\mathbb{R}^{d})$ , the input $x$ represents a $d$ -dimensional vector. In this very special case, the layer contains $d$ traditional neurons, each receiving an input signal $x_{1},x_{2},...,x_{d}\in\mathbb{R}$ , respectively. Note that in the aforementioned architecture, the element $x\in X$ carries all the information of the input layer. This structure enables the network to accommodate a wide diversity of input types. In general, a single hidden layer TFNN computes a function of the form

\sum_{i=1}^{r}c_{i}\sigma(w_{i}f_{i}(x)-\theta_{i}),

(1.2)

where $x\in X$ is the input, $f_{i}\in\mathcal{A}(X),$ $c_{i},w_{i},\theta_{i}\in\mathbb{R}$ are the parameters of the network, and $\sigma:$ $\mathbb{R\rightarrow R}$ is a fixed activation function.

The aim of this paper is to show that for a broad a class activation functions $\sigma$ , neural networks of the form presented in (1.2) can approximate any continuous function on a compact subset $K\subset X$ with arbitrary precision. In other words, the set

\mathcal{N}(\sigma)=span\{\sigma(wf(x)-\theta):f\in\mathcal{A}(X);~{}w,\theta% \in\mathbb{R}\}

is dense in $C(K)$ for every compact set $K\subset X$ . As an application of this result, we will derive an approximative version of the Kolmogorov superposition theorem (KST) for compact metric spaces, where outer (non-fixed and generally nonsmooth) functions are substituted with a fixed ultimately smooth function.

It should be noted that the UAP of neural networks operating between Banach spaces has been explored in various studies. For example, in [24], the fundamentality of ridge functions was established in a Banach space and subsequently applied to shallow networks with a sigmoidal activation function (see also [15]). In [2], the authors showed that any continuous nonlinear function mapping a compact set $V$ in a Banach space of continuous functions $C(K_{1})$ into $C(K_{2})$ can be approximated arbitrarily well by shallow feedforward neural networks. Here $K_{1}$ and $K_{2}$ represent two compact sets in an abstract Banach space $X$ and the Euclidean space $\mathbb{R}^{d}$ , respectively. In [16], this approach was extended to deep neural networks and referred to as DeepONet. In [13], the authors examined DeepONet within the context of an encoder-decoder network framework, investigating its approximation properties when the input space is a Hilbert space. In [11], quantitative estimates (i.e., convergence rates) for the approximation of nonlinear operators using single-hidden layer networks in infinite-dimensional Banach spaces were provided, extending some previous results from the finite-dimensional case.

The UAP of infinite-dimensional neural networks, with inputs from Fréchet spaces and outputs from Banach spaces, was established in [1]. In [3], the scope of this architecture was extended by proving several universal approximation theorems for quasi-Polish input and output spaces.

In [12], universal approximation theorems were obtained for neural operators (NOs) and mixtures of neural operators (MoNOs) acting between Sobolev spaces. More precisely, it was shown that any non-linear continuous operator acting between Sobolev spaces $H^{s_{1}}$ and $H^{s_{2}}$ can be uniformly approximated over any compact set $K\subset$ $H^{s_{1}}$ with arbitrary accuracy $\varepsilon$ using NOs and MoNOs: $H^{s_{1}}\rightarrow H^{s_{2}}$ . Moreover, the quantitative results of [12] estimate the depth, width, and rank of the neural operators in terms of the radius of $K$ and $\varepsilon$ .

Recent research has demonstrated the universal approximation theorem (UAT) for various hypercomplex-valued neural networks, including complex-, quaternion-, tessarine-, and Clifford-valued networks, as well as more general vector-valued neural networks (V-nets) defined over a finite-dimensional algebra (see [25] and references therein). We hope that the results of this paper will stimulate further exploration of these neural networks, particularly with outputs from these and other general spaces.

2. Main results

In this section, we analyze the conditions under which shallow networks with inputs from a topological space possess the universal approximation property.

Assume $X$ is an arbitrary topological space. In the sequel, in $C(X)$ , we will use the topology of uniform convergence on compact sets. This topology is induced by the seminorms

\left\|g\right\|_{K}=\max_{x\in K}\left|g(x)\right|,

where $K$ are compact sets in $X$ . A subbasis at the origin for this topology is given by the sets

U(K,r)=\left\{g\in C(X):\left\|g\right\|_{K}<r\right\},

where $K\subset X$ is compact and $r>0$ . A sequence (or net) $\{g_{n}\}$ in this topology converges to $g$ iff $\left\|g_{n}-g\right\|_{K}\rightarrow 0$ for every compact set $K\subset X$ . Thus, in what follows, when we say that $B$ is dense in $C(X)$ , we will mean that $B$ is dense with respect to the aforementioned topology of uniform convergence on compact sets.

We say that a subclass $\mathcal{A}(X)\subset C(X)$ holds the $D$ -property if the set

S=span\left\{u\circ v:u\in C(\mathbb{R}),v\in\mathcal{A}(X)\right\}

(2.1)

is dense in $C(X)$ .

In what follows, we will use activation functions $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ (whether continuous or discontinuous) with the property that the $span\{\sigma(wx-\theta):w\in\mathbb{R},\theta\in\mathbb{R}\}$ is dense in every $C[a,b]$ . Such functions are called Tauber-Wiener (TW) functions (see [2]).

Theorem 2.1. Assume $X$ is a topological space, $\mathcal{A}(X)$ is a subclass of $C(X)$ with the $D$ -property and $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ is a TW function. Then for any $\varepsilon>0$ , any compact set $K\subset X$ and any function $g\in C(K)$ there exist $r\in\mathbb{N}$ , $f_{i}\in\mathcal{A}(X)$ , $c_{i},w_{i},\theta_{i},\in\mathbb{R}$ , $i=1,...,r$ , such that

\max_{x\in K}\left|g(x)-\sum_{i=1}^{r}c_{i}\sigma(w_{i}f_{i}(x)-\theta_{i})% \right|<\varepsilon.

That is, TFNNs with inputs from $X$ is dense in $C(X)$ .

Proof. Take any $\varepsilon>0$ , any compact set $K\subset X$ and any function $g\in C(K)$ . Since $\mathcal{A}(X)$ has the $D$ -property, there exist finitely many functions $u_{i}\in C(\mathbb{R})$ and $v_{i}\in C(X)$ such that

\left|g(x)-\sum_{i=1}^{n}u_{i}(v_{i}(x))\right|<\varepsilon/2,

(2.2)

for all $x\in K$ .

Since $v_{i}$ are continuous, the images $v_{i}(K)$ are compact sets in $\mathbb{R}$ . Set $V=\cup_{i=1}^{n}v_{i}(K)$ . Note that $V$ is also compact.

Since $\sigma$ is a TW function, each continuous univariate function $u_{i}(t)$ , $t$ $\in V$ , can be approximated by single hidden layer networks with the activation function $\sigma$ . Thus, there exist coefficients $c_{ij},w_{ij},\theta_{ij}\in\mathbb{R}$ , $1\leq i\leq n$ , $1\leq j\leq k_{i}$ , such that

\left|u_{i}(t)-\sum_{j=1}^{k_{i}}c_{ij}\sigma(w_{ij}t-\theta_{ij})\right|<% \varepsilon/2n

for all $t\in V$ . Therefore,

\left|u_{i}(v_{i}(x))-\sum_{j=1}^{k_{i}}c_{ij}\sigma(w_{ij}v_{i}(x)-\theta_{ij% })\right|<\varepsilon/2n

(2.3)

for each $i=1,...,n$ , and all $x\in K$ . It follows from (2.2) and (2.3) that

\left|g(x)-\sum_{i=1}^{n}\sum_{j=1}^{k_{i}}c_{ij}\sigma(w_{ij}v_{i}(x)-\theta_% {ij})\right|<\varepsilon

for any $x\in K$ . This completes the proof of Theorem 2.1.

Remark. Theorem 2.1 generalizes existing universal approximation theorems for traditional feedforward neural networks. This is because in traditional networks, the space of linear continuous functionals on $\mathbb{R}^{d}$ serves as the basic family $\mathcal{A}(X)$ , which clearly satisfies the $D$ -property.

Note that, in particular, $X$ may be a topological vector space. For such a space, $X^{\ast}$ denotes the continuous dual of $X$ , which is the space of linear continuous functionals defined on $X$ . The following theorem is based on Theorem 2.1.

Theorem 2.2. Assume $X$ is a locally convex topological vector space (in particular, a normed space) and $\sigma$ is a continuous univariate function that is not a polynomial. Then for any $\varepsilon>0$ , any compact set $K\subset X$ and any function $g\in C(K)$ there exist $r\in\mathbb{N}$ , $f_{i}\in X^{\ast}$ , $c_{i},\theta_{i}\in\mathbb{R}$ , $i=1,...,r$ , such that

\max_{x\in K}\left|g(x)-\sum_{i=1}^{r}c_{i}\sigma(f_{i}(x)-\theta_{i})\right|<\varepsilon.

The proof of this theorem relies on Theorem 2.1 and the following two facts.

Fact 1. The space $X^{\ast}$ possesses the $D$ -property.

Let us prove this fact. Specifically, this property holds if in (2.1), instead of all $u\in C(\mathbb{R})$ , we take the single function $u(t)=e^{t}$ . That is, we claim that the set

S=span\{e^{r(x)}:r\in X^{\ast}\}.

is dense in $C(K)$ for every compact set $K\subset X$ .

Indeed, first it is not difficult to see that $S$ is a subalgebra of $C(X)$ . To see this, note that for any $r_{1},r_{2}\in X^{\ast}$

e^{r_{1}(x)}e^{r_{2}(x)}=e^{r_{1}(x)+r_{2}(x)}\in E\text{,}

since $r_{1}+r_{2}\in X^{\ast}$ . Therefore, the linear space $S$ is closed under multiplication, indicating that $S$ is an algebra.

Second, if $r$ is the zero functional, then $e^{r(x)}=1$ , showing that $S$ contains all constant functions.

Now since $X$ is locally convex, the Hahn-Banach extension theorem holds. It is a consequence of this theorem that for any distinct points $x_{1},x_{2}\in X$ there exists a functional $r\in X^{\ast}$ such that $r(x_{1})\neq r(x_{2})$ (see [22, Theorem 3.6]). Hence, the algebra $S$ separates points in $X$ . By the Stone-Weierstrass theorem [23], for any compact $K\subset X$ , the algebra $S$ restricted to $K$ is dense in $C(K)$ . In other words, the space $X^{\ast}$ has the $D$ -property.

Fact 2. A continuous nonpolynomial activation function $\sigma$ is a TW function.

This fact follows from the main result of [14] that a continuous nonpolynomial activation function provides the universal approximation property for traditional single hidden layer networks.

Let us now we apply Theorem 2.1 to derive an approximative version of the renowned Kolmogorov superposition theorem (KST) for compact metric spaces. KST [10] states that for the unit cube $\mathbb{I}^{d},~{}\mathbb{I}=[0,1],~{}d\geq 2,$ there exist $2d+1$ functions $\{s_{q}\}_{q=1}^{2d+1}\subset C(\mathbb{I}^{d})$ of the form

s_{q}(x_{1},...,x_{d})=\sum_{p=1}^{d}\varphi_{pq}(x_{p}),~{}\varphi_{pq}\in C(% \mathbb{I}),~{}p=1,...,d,~{}q=1,...,2d+1,

such that each function $f\in C(\mathbb{I}^{d})$ admits the representation

f(\mathbf{x})=\sum_{q=1}^{2d+1}g_{q}(s_{q}(\mathbf{x})),~{}\mathbf{x}=(x_{1},.% ..,x_{d})\in\mathbb{I}^{d},~{}g_{q}\in C({{\mathbb{R)}}}.

This surprising and deep result, which solved (negatively) Hilbert’s 13-th problem, has been improved and generalized in several directions. For detailed information about KST, including its refinements, various variants, and generalizations, see the monographs [9, Chapter 1] and [7, Chapter 4]. The relevance of KST to neural networks, along with its theoretical and computational aspects, has been extensively discussed in the neural network literature (see, e.g., [8] and references therein).

Ostrand [18] extended KST to general compact metric spaces as follows.

Theorem 2.3. (Ostrand [18]). For $p=1,2,...,n$ , let $X_{p}$ be a compact metric space of finite dimension $d_{p}$ and let $m=\sum_{p=1}^{n}d_{p}.$ There exist universal continuous functions $\psi_{pq}:X_{p}\rightarrow[0,1],$ $p=1,...,n,$ $q=1,...,2m+1,$ such that every continuous function $g$ defined on $\Pi_{p=1}^{n}X_{p}$ is representable in the form

g(x_{1},...,x_{n})=\sum_{q=1}^{2m+1}h_{q}(\sum_{p=1}^{n}\psi_{pq}(x_{p})),

where $h_{q}$ are continuous functions depending on $g$ .

It follows from this theorem that for the metric space $X=\Pi_{p=1}^{n}X_{p}$ the family of $2m+1$ functions

\mathcal{K}(X)=\left\{\sum_{p=1}^{n}\psi_{pq}(x_{p}):q=1,...,2m+1\right\}

satisfies the $D$ -property in $C(X)$ .

Let now $\sigma$ be a specific infinitely differentiable TW function with the property that, for any interval $[a,b]$ , the set $\Lambda=\{\sigma(wx-\theta):w,\theta\in\mathbb{R}\}$ is dense in $C[a,b]$ . Note that this is not the linear span of the functions $\sigma(wx-\theta)$ , but rather a very narrow subclass of it. Such functions $\sigma$ do indeed exist.

To show this, let $\alpha$ be any positive real number. Divide the interval $[\alpha,+\infty)$ into the segments $[\alpha,2\alpha],$ $[2\alpha,3\alpha],...$ . Let $\{p_{n}(t)\}_{n=1}^{\infty}$ be the sequence of polynomials with rational coefficients defined on $[0,1].$ We construct $\sigma$ in two stages. In the first stage, we define $\sigma$ on the closed intervals $[(2m-1)\alpha,2m\alpha],$ $m=1,2,...$ as the function

\sigma(t)=p_{m}(\frac{t}{\alpha}-2m+1),\text{ }t\in[(2m-1)\alpha,2m\alpha],

or equivalently,

\sigma(\alpha t+(2m-1)\alpha)=p_{m}(t),\text{ }t\in[0,1].

(2.4)

In the second stage, we extend $\sigma$ to the intervals $(2m\alpha,(2m+1)\alpha),$ $m=1,2,...,$ and $(-\infty,\alpha)$ , maintaining the $C^{\infty}$ property.

For any univariate function $h\in C[0,1]$ and any $\varepsilon>0$ there exists a polynomial $p(t)$ with rational coefficients such that

\left|h(t)-p(t)\right|<\varepsilon,

for all $t\in[0,1].$ This together with (2.4) mean that

\left|h(t)-\sigma(\alpha t-s)\right|<\varepsilon,

(2.5)

for some $s\in\mathbb{R}$ and all $t\in[0,1].$

Using linear transformation it is not difficult to go from $[0,1]$ to any finite closed interval $[a,b]$ . Indeed, let $u\in C[a,b]$ , $\sigma$ be constructed as above and $\varepsilon$ be an arbitrarily small positive number. The transformed function $h(t)=u(a+(b-a)t)$ is well defined on $[0,1]$ and we can apply the inequality (2.5). Now using the inverse transformation $t=\frac{x-a}{b-a}$ , we can write that

\left|u(x)-\sigma(wx-\theta)\right|<\varepsilon,

(2.6)

for all $x\in[a,b]$ , where $w=\frac{\alpha}{b-a}$ and $\theta=\frac{\alpha a}{b-a}+s$ .

We define activation functions $\sigma$ as superactivation functions if they satisfy (2.6) for any $u\in C[a,b]$ , $\varepsilon>0$ , and some $w,\theta\in\mathbb{R}$ . These functions demonstrate that shallow networks can approximate univariate continuous functions with the minimal number of hidden neurons; in fact, a single hidden neuron is sufficient. Similar activation functions $\sigma$ , with additional properties of monotonicity and sigmoidality, were algorithmically constructed in [5] and utilized in practical examples. It should be remarked that the existence of activation functions that ensure universal approximation for single and two hidden layer neural networks with a fixed number of hidden units was first established in [17].

If in Theorem 2.1, we take $\mathcal{A}(X)=\mathcal{K}(X)$ and any superactivation function $\sigma$ , then the number of terms $r$ will be $2m+1$ . To see this, it is sufficient to repeat the proof, noting that $n=2m+1$ and $k=1$ . This observation leads to the following theorem.

Theorem 2.4. For $p=1,2,...,n$ , let $X_{p}$ be a compact metric space of finite dimension $d_{p}$ and let $m=\sum_{p=1}^{n}d_{p}.$ There exist universal continuous functions $\psi_{pq}:X_{p}\rightarrow[0,1],$ $p=1,...,n,$ $q=1,...,2m+1,$ and an infinitely differentiable function $\sigma:\mathbb{R}\rightarrow\mathbb{R}$ such that for every continuous function $g$ defined on $X=\Pi_{p=1}^{n}X_{p}$ and any $\varepsilon>0$ there exist $w_{q},\theta_{q},\in\mathbb{R}$ , $q=1,...,2m+1$ , such that

\left|g(x_{1},...,x_{n})-\sum_{q=1}^{2m+1}\sigma(w_{q}\sum_{p=1}^{n}\psi_{pq}(% x_{p})-\theta_{q})\right|<\varepsilon,

for all $(x_{1},...,x_{n})\in X$ .

Note that in Theorem 2.4 the outer function $\sigma$ does not depend on $g$ . The only parameters that depend on $g$ are the numbers $\theta_{q}$ . The numbers $w_{q}$ can be taken to be equal and fixed once and for all. This is evident from the construction of $\sigma$ above (see (2.6), where $w$ is fixed for all $u$ ). For example, if we set $\alpha=b-a$ , where $[a,b]$ is a closed interval containing all the sets $\Psi_{q}(X),$ where $\Psi_{q}(x_{1},...,x_{n})=\sum_{p=1}^{n}\psi_{pq}(x_{p})$ , $q=1,...,2m+1$ , then $w_{q}$ can all be taken to be equal to $1$ .

References

[1] F. E. Benth, N. Detering, and L. Galimberti, Neural networks in Fréchet spaces, Ann. Math. Artif. Intell. 91 (2023), 75-103.
[2] T. Chen and H. Chen, Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems, IEEE Trans. Neural Netw. 6 (1995), no. 4, 911-917.
[3] L. Galimberti, Neural networks in non-metric spaces, arXiv preprint, arXiv:2406.09310 [math.FA], 2024.
[4] I. Goodfellow, Y. Bengio and A. Courville, Deep Learning, MIT Press, Cambridge, MA, 2016.
[5] N. J. Guliyev and V. E. Ismailov, On the approximation by single hidden layer feedforward neural networks with fixed weights, Neural Netw. 98 (2018), 296-304.
[6] N. J. Guliyev and V. E. Ismailov, Approximation capability of two hidden layer feedforward neural networks with fixed weights, Neurocomputing 316 (2018), 262-269.
[7] V. E. Ismailov, Ridge Functions and Applications in Neural Networks, Mathematical Surveys and Monographs, 263. American Mathematical Society, 2021.
[8] A. Ismayilova, V. E. Ismailov, On the Kolmogorov neural networks, Neural Netw. 176 (2024), Paper No. 106333.
[9] S. Ya. Khavinson, Best approximation by linear superpositions (approximate nomography), Translated from the Russian manuscript by D. Khavinson. Translations of Mathematical Monographs, 159. American Mathematical Society, Providence, RI, 1997, 175 pp.
[10] A. N. Kolmogorov, On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. (Russian), Dokl. Akad. Nauk SSSR 114 (1957), 953-956.
[11] Y. Korolev, Two-layer neural networks with values in a Banach space, SIAM J. Math. Anal. 54 (2022), no. 6, 6358-6389.
[12] A. Kratsios, T. Furuya, A. Lara, M. Lassas, M. de Hoop, Mixture of experts soften the curse of dimensionality in operator learning, arXiv preprint, arXiv:2404.09101 [cs.LG], 2024.
[13] S. Lanthaler, S. Mishra and G. E. Karniadakis, Error estimates for DeepONets: a deep learning framework in infinite dimensions, Trans. Math. Appl. 6 (2022), no. 1, tnac001, 141 pp.
[14] M. Leshno, V. Ya. Lin, A. Pinkus and S. Schocken, Multilayer feedforward networks with a nonpolynomial activation function can approximate any function, Neural Netw. 6 (1993), 861-867.
[15] W. Light, Ridge functions, sigmoidal functions and neural networks, Approximation theory VII (Austin, TX, 1992), 163-206, Academic Press, Boston, MA, 1993.
[16] L. Lu, P. Jin, G. Pang, Z. Zhang and G. E. Karniadakis, Learning nonlinear operators via DeepONet based on the universal approximation theorem of operators, Nat. Mach. Intell. 3 (2021), no. 3, 218-229.
[17] V. Maiorov, A. Pinkus, Lower bounds for approximation by MLP neural networks, Neurocomputing 25 (1999), 81-91.
[18] P. A. Ostrand, Dimension of metric spaces and Hilbert’s problem $13$, Bull. Amer. Math. Soc. 71 (1965), 619-622.
[19] P. Petersen and J. Zech, Mathematical theory of deep learning, arXiv preprint, arXiv:2407.18384 [cs.LG], 2024.
[20] A. Pinkus, Approximation theory of the MLP model in neural networks, Acta Numerica 8 (1999), 143-195.
[21] S. Roman, Advanced linear algebra, Third edition. Graduate Texts in Mathematics, 135. Springer, New York, 2008.
[22] W. Rudin, Functional analysis, Second edition. International Series in Pure and Applied Mathematics. McGraw-Hill, Inc., New York, 1991, 424 pp.
[23] M. H. Stone, The generalized Weierstrass approximation theorem, Math. Mag. 21 (1948), 167-184, 237-254.
[24] X. Sun, E. W. Cheney, The fundamentality of sets of ridge functions, Aequationes Math. 44 (1992), no. 2-3, 226-235.
[25] M. E. Valle, W. L. Vital and G. Vieira, Universal approximation theorem for vector- and hypercomplex-valued neural networks, Neural Netw. 180 (2024), 106632.