On shallow feedforward neural networks with inputs from a
topological space
Vugar E. Ismailov
Institute of Mathematics and Mechanics, Baku, Azerbaijan
Center for Mathematics and its Applications, Khazar University, Baku,
Azerbaijan
e-mail: [email protected]
Abstract. We study feedforward neural networks with inputs from a
topological space (TFNNs). We prove a universal approximation theorem for
shallow TFNNs, which demonstrates their capacity to approximate any
continuous function defined on this topological space. As an application, we
obtain an approximative version of Kolmogorov’s superposition theorem for
compact metric spaces.
Mathematics Subject Classifications: 41A30, 41A65, 68T05
Keywords: feedforward neural network, universal approximation
theorem, density, topological vector space, Tauber-Wiener function,
Kolmogorov’s superposition theorem
Neural networks are fundamental to contemporary machine learning and
artificial intelligence, providing robust methods for tackling intricate
challenges. Among the different neural network designs, the multilayer feedforward perceptron (MLP) is particularly prominent and
essential. The MLP is valued for its capability to model complex, nonlinear
functions and execute various tasks, including classification, regression,
and pattern recognition.
This architecture consists of a limited number of sequential layers: an
input layer at the beginning, an output layer at the end, and several hidden
layers in between. Information progresses from the input layer through the
hidden layers to the output layer. In this framework, each neuron in a layer
receives inputs from the previous layer, applies specific weights, adds a
bias, and then processes the result through an activation function. This
activation function introduces non-linearity, allowing the model to learn
and capture intricate patterns. The output from one layer’s neurons serves
as the input for the neurons in the next layer, continuing this sequence
until the final output is generated by the output layer.
The most basic form of an MLP features just one hidden layer. In this setup,
each output neuron calculates a function expressed as
|
|
|
|
where represents the input vector, is the
number of neurons in the hidden layer, are weight
vectors in , are thresholds,
are coefficients, and is the activation function, a real
univariate function.
The theoretical underpinning of neural networks is rooted in the universal approximation property (UAP), sometimes referred to as the
density property. This principle states that a neural network with a single
hidden layer can approximate any continuous function over a compact domain
to any desired level of precision. Specifically, the set , which comprises functions defined in the format of equation (1.1),
is dense in for every compact set . Here represents the space of real-valued continuous functions on . This
important result in neural network theory is known as the universal
approximation theorem (UAT).
Extensive research has investigated the UAT across various activation
functions , examining how different choices influence the
approximation capabilities of neural networks. The most general result in
this area was obtained by Leshno, Lin, Pinkus and Schocken [14].
They proved that a continuous activation function possesses the
UAP if and only if it is not a polynomial. This result demonstrates the
effectiveness of the single hidden layer perceptron model across a wide
range of activation functions . It should be noted that, the
universal approximation theorem in [14] was shown to apply to a
broader class of activation functions beyond just continuous ones, including
activation functions that may have discontinuities on sets of Lebesgue
measure zero. However, this paper will specifically concentrate on
continuous activation functions. For a thorough, step-by-step proof of this
theorem, refer to [19, 20].
In the past, it was commonly accepted and highlighted in numerous studies
that attaining the universal approximation property necessitate large
networks with a substantial number of hidden neurons (see, e.g., [4, Chapter 6.4.1]). In the above-mentioned earlier works, the number of
hidden neurons was regarded as unbounded. However, more recent research [5, 6, 7] has demonstrated that neural networks using certain
non-explicit but practically computable activation functions can approximate
any continuous function over any compact set to any desired level of
accuracy, even with a minimal and fixed number of hidden neurons.
Note that the inner product in (1.1)
represents a linear continuous functional on . Conversely,
by Riesz representation theorem, every linear functional on
is of the form , where and
is the variable (see [21, Theorem 13.32]). Linear continuous functionals constitute a significant subclass in
, denoted here by . Thus,
UAT asserts that for certain activation functions and any compact
set , the set
|
|
|
is dense in . This observation tells the following generalization of
single hidden layer networks from to any topological space where is replaced with a fixed family of
functions (which need not be linear) from . We refer to such a family
as a basic family for the feedforward neural networks with inputs
from a topological space (TFNNs). If is a basic
family, then the architecture of a single hidden layer TFNN can be described
as follows:
-
•
Input Layer: This layer consists of an element ,
where is an arbitrary topological space.
-
•
Hidden layer: Each neuron in the hidden layer takes the input
from the input layer and applies a function to . This value is then multiplied by a weight . A shift and then
a fixed activation function are applied
to . The resulting value represents the
output signal of the neuron.
-
•
Output layer: Each neuron in this layer receives weighted
signals from each neuron in the hidden layer, sums them up, and produces the
final output value.
This architecture significantly extends the traditional feedforward neural
networks. When and , the input represents a -dimensional vector. In this very
special case, the layer contains traditional neurons, each receiving an
input signal , respectively. Note that
in the aforementioned architecture, the element carries all the
information of the input layer. This structure enables the network to
accommodate a wide diversity of input types. In general, a single hidden
layer TFNN computes a function of the form
|
|
|
|
where is the input, are the parameters of the network, and is a fixed activation function.
The aim of this paper is to show that for a broad a class activation
functions , neural networks of the form presented in (1.2) can
approximate any continuous function on a compact subset with
arbitrary precision. In other words, the set
|
|
|
is dense in for every compact set . As an application of
this result, we will derive an approximative version of the Kolmogorov superposition theorem (KST) for compact metric spaces, where
outer (non-fixed and generally nonsmooth) functions are substituted with a
fixed ultimately smooth function.
It should be noted that the UAP of neural networks operating between Banach
spaces has been explored in various studies. For example, in [24], the
fundamentality of ridge functions was established in a Banach space and
subsequently applied to shallow networks with a sigmoidal activation
function (see also [15]). In [2], the authors showed that
any continuous nonlinear function mapping a compact set in a Banach
space of continuous functions into can be approximated
arbitrarily well by shallow feedforward neural networks. Here and represent two compact sets in an abstract Banach space and the
Euclidean space , respectively. In [16], this approach
was extended to deep neural networks and referred to as DeepONet. In [13], the authors examined DeepONet within the context of an
encoder-decoder network framework, investigating its approximation
properties when the input space is a Hilbert space. In [11],
quantitative estimates (i.e., convergence rates) for the approximation of
nonlinear operators using single-hidden layer networks in
infinite-dimensional Banach spaces were provided, extending some previous
results from the finite-dimensional case.
The UAP of infinite-dimensional neural networks, with inputs from Fréchet spaces and outputs from Banach spaces, was established in [1].
In [3], the scope of this architecture was extended by proving
several universal approximation theorems for quasi-Polish input and output
spaces.
In [12], universal approximation theorems were obtained for neural
operators (NOs) and mixtures of neural operators (MoNOs) acting between
Sobolev spaces. More precisely, it was shown that any non-linear continuous
operator acting between Sobolev spaces and can be
uniformly approximated over any compact set with
arbitrary accuracy using NOs and MoNOs: . Moreover, the quantitative results of [12] estimate the
depth, width, and rank of the neural operators in terms of the radius of
and .
Recent research has demonstrated the universal approximation theorem (UAT)
for various hypercomplex-valued neural networks, including complex-,
quaternion-, tessarine-, and Clifford-valued networks, as well as more
general vector-valued neural networks (V-nets) defined over a
finite-dimensional algebra (see [25] and references therein). We
hope that the results of this paper will stimulate further exploration of
these neural networks, particularly with outputs from these and other
general spaces.
In this section, we analyze the conditions under which shallow networks with
inputs from a topological space possess the universal approximation property.
Assume is an arbitrary topological space. In the sequel, in , we
will use the topology of uniform convergence on compact sets. This topology
is induced by the seminorms
|
|
|
where are compact sets in . A subbasis at the origin for this
topology is given by the sets
|
|
|
where is compact and . A sequence (or net) in
this topology converges to iff for every compact set . Thus, in what
follows, when we say that is dense in , we will mean that is
dense with respect to the aforementioned topology of uniform convergence on
compact sets.
We say that a subclass holds the -property
if the set
|
|
|
|
is dense in .
In what follows, we will use activation functions (whether continuous or discontinuous) with the
property that the is dense in every . Such functions are called
Tauber-Wiener (TW) functions (see [2]).
Theorem 2.1. Assume is a topological space, is a subclass of with the -property and is a TW function. Then for any , any
compact set and any function there exist , , , , such that
|
|
|
That is, TFNNs with inputs from is dense in .
Proof. Take any , any compact set and
any function . Since has the -property, there
exist finitely many functions and
such that
|
|
|
|
for all .
Since are continuous, the images are compact sets in . Set . Note that is also compact.
Since is a TW function, each continuous univariate function , , can be approximated by single hidden layer networks
with the activation function . Thus, there exist coefficients , , , such that
|
|
|
for all . Therefore,
|
|
|
|
for each , and all . It follows from (2.2) and (2.3) that
|
|
|
for any . This completes the proof of Theorem 2.1.
Remark. Theorem 2.1 generalizes existing universal approximation
theorems for traditional feedforward neural networks. This is because in
traditional networks, the space of linear continuous functionals on serves as the basic family , which clearly satisfies
the -property.
Note that, in particular, may be a topological vector space. For such a
space, denotes the continuous dual of , which is the space of
linear continuous functionals defined on . The following theorem is based
on Theorem 2.1.
Theorem 2.2. Assume is a locally convex topological
vector space (in particular, a normed space) and is a continuous
univariate function that is not a polynomial. Then for any ,
any compact set and any function there exist , , , , such that
|
|
|
The proof of this theorem relies on Theorem 2.1 and the following two facts.
Fact 1. The space possesses the -property.
Let us prove this fact. Specifically, this property holds if in (2.1),
instead of all , we take the single function . That is, we claim that the set
|
|
|
is dense in for every compact set .
Indeed, first it is not difficult to see that is a subalgebra of .
To see this, note that for any
|
|
|
since . Therefore, the linear space is closed
under multiplication, indicating that is an algebra.
Second, if is the zero functional, then , showing that
contains all constant functions.
Now since is locally convex, the Hahn-Banach extension theorem holds. It
is a consequence of this theorem that for any distinct points there exists a functional such that (see [22, Theorem 3.6]). Hence, the algebra separates points in . By the Stone-Weierstrass theorem [23],
for any compact , the algebra restricted to is dense in . In other words, the space has the -property.
Fact 2. A continuous nonpolynomial activation function is
a TW function.
This fact follows from the main result of [14] that a continuous
nonpolynomial activation function provides the universal approximation
property for traditional single hidden layer networks.
Let us now we apply Theorem 2.1 to derive an approximative version of the
renowned Kolmogorov superposition theorem (KST) for compact metric spaces.
KST [10] states that for the unit cube there exist functions of the form
|
|
|
such that each function admits the representation
|
|
|
This surprising and deep result, which solved (negatively) Hilbert’s 13-th
problem, has been improved and generalized in several directions. For
detailed information about KST, including its refinements, various variants,
and generalizations, see the monographs [9, Chapter 1] and [7, Chapter 4]. The relevance of KST to neural networks, along with its
theoretical and computational aspects, has been extensively discussed in the
neural network literature (see, e.g., [8] and references therein).
Ostrand [18] extended KST to general compact metric spaces as follows.
Theorem 2.3. (Ostrand [18]). For , let be a compact metric space of finite dimension and let There exist universal continuous functions such that
every continuous function defined on is
representable in the form
|
|
|
where are continuous functions depending on .
It follows from this theorem that for the metric space the family of functions
|
|
|
satisfies the -property in .
Let now be a specific infinitely differentiable TW function with
the property that, for any interval , the set is dense in . Note that this
is not the linear span of the functions , but rather a
very narrow subclass of it. Such functions do indeed exist.
To show this, let be any positive real number. Divide the interval
into the segments . Let be the sequence of
polynomials with rational coefficients defined on We construct in two stages. In the first stage, we define on the
closed intervals as the function
|
|
|
or equivalently,
|
|
|
|
In the second stage, we extend to the intervals and , maintaining the property.
For any univariate function and any there
exists a polynomial with rational coefficients such that
|
|
|
for all This together with (2.4) mean that
|
|
|
|
for some and all
Using linear transformation it is not difficult to go from to any
finite closed interval . Indeed, let , be
constructed as above and be an arbitrarily small positive
number. The transformed function is well defined on and we can apply the inequality (2.5). Now using the inverse
transformation , we can write that
|
|
|
|
for all , where and .
We define activation functions as superactivation
functions if they satisfy (2.6) for any , ,
and some . These functions demonstrate that shallow
networks can approximate univariate continuous functions with the minimal
number of hidden neurons; in fact, a single hidden neuron is sufficient.
Similar activation functions , with additional properties of
monotonicity and sigmoidality, were algorithmically constructed in [5]
and utilized in practical examples. It should be remarked that the existence
of activation functions that ensure universal approximation for single and
two hidden layer neural networks with a fixed number of hidden units was
first established in [17].
If in Theorem 2.1, we take and any
superactivation function , then the number of terms will be . To see this, it is sufficient to repeat the proof, noting that and . This observation leads to the following theorem.
Theorem 2.4. For , let be a compact
metric space of finite dimension and let
There exist universal continuous functions and an infinitely differentiable
function such that for every
continuous function defined on and any there exist , , such that
|
|
|
for all .
Note that in Theorem 2.4 the outer function does not depend on . The only parameters that depend on are the numbers . The
numbers can be taken to be equal and fixed once and for all. This is
evident from the construction of above (see (2.6), where is
fixed for all ). For example, if we set , where is a
closed interval containing all the sets where , , then
can all be taken to be equal to .
References
-
[1]
F. E. Benth, N. Detering, and L. Galimberti, Neural networks
in Fréchet spaces, Ann. Math. Artif. Intell. 91
(2023), 75-103.
-
[2]
T. Chen and H. Chen, Universal approximation to nonlinear
operators by neural networks with arbitrary activation functions and its
application to dynamical systems, IEEE Trans. Neural Netw. 6 (1995), no. 4, 911-917.
-
[3]
L. Galimberti, Neural networks in non-metric spaces, arXiv
preprint, arXiv:2406.09310 [math.FA], 2024.
-
[4]
I. Goodfellow, Y. Bengio and A. Courville, Deep
Learning, MIT Press, Cambridge, MA, 2016.
-
[5]
N. J. Guliyev and V. E. Ismailov, On the approximation by
single hidden layer feedforward neural networks with fixed weights, Neural Netw. 98 (2018), 296-304.
-
[6]
N. J. Guliyev and V. E. Ismailov, Approximation capability of
two hidden layer feedforward neural networks with fixed weights, Neurocomputing 316 (2018), 262-269.
-
[7]
V. E. Ismailov, Ridge Functions and Applications in
Neural Networks, Mathematical Surveys and Monographs, 263. American
Mathematical Society, 2021.
-
[8]
A. Ismayilova, V. E. Ismailov, On the Kolmogorov neural
networks, Neural Netw. 176 (2024), Paper No. 106333.
-
[9]
S. Ya. Khavinson, Best approximation by linear
superpositions (approximate nomography), Translated from the Russian
manuscript by D. Khavinson. Translations of Mathematical Monographs, 159.
American Mathematical Society, Providence, RI, 1997, 175 pp.
-
[10]
A. N. Kolmogorov, On the representation of continuous
functions of many variables by superposition of continuous functions of one
variable and addition. (Russian), Dokl. Akad. Nauk SSSR 114
(1957), 953-956.
-
[11]
Y. Korolev, Two-layer neural networks with values in a Banach
space, SIAM J. Math. Anal. 54 (2022), no. 6, 6358-6389.
-
[12]
A. Kratsios, T. Furuya, A. Lara, M. Lassas, M. de Hoop,
Mixture of experts soften the curse of dimensionality in operator learning,
arXiv preprint, arXiv:2404.09101 [cs.LG], 2024.
-
[13]
S. Lanthaler, S. Mishra and G. E. Karniadakis, Error
estimates for DeepONets: a deep learning framework in infinite dimensions,
Trans. Math. Appl. 6 (2022), no. 1, tnac001, 141 pp.
-
[14]
M. Leshno, V. Ya. Lin, A. Pinkus and S. Schocken,
Multilayer feedforward networks with a nonpolynomial activation function can
approximate any function, Neural Netw. 6 (1993), 861-867.
-
[15]
W. Light, Ridge functions, sigmoidal functions and neural
networks, Approximation theory VII (Austin, TX, 1992), 163-206, Academic
Press, Boston, MA, 1993.
-
[16]
L. Lu, P. Jin, G. Pang, Z. Zhang and G. E. Karniadakis,
Learning nonlinear operators via DeepONet based on the universal
approximation theorem of operators, Nat. Mach. Intell. 3
(2021), no. 3, 218-229.
-
[17]
V. Maiorov, A. Pinkus, Lower bounds for approximation by MLP
neural networks, Neurocomputing 25 (1999), 81-91.
-
[18]
P. A. Ostrand, Dimension of metric spaces and Hilbert’s
problem $13$, Bull. Amer. Math. Soc. 71 (1965), 619-622.
-
[19]
P. Petersen and J. Zech, Mathematical theory of deep learning,
arXiv preprint, arXiv:2407.18384 [cs.LG], 2024.
-
[20]
A. Pinkus, Approximation theory of the MLP model in neural
networks, Acta Numerica 8 (1999), 143-195.
-
[21]
S. Roman, Advanced linear algebra, Third edition.
Graduate Texts in Mathematics, 135. Springer, New York, 2008.
-
[22]
W. Rudin, Functional analysis, Second edition.
International Series in Pure and Applied Mathematics. McGraw-Hill, Inc., New
York, 1991, 424 pp.
-
[23]
M. H. Stone, The generalized Weierstrass approximation
theorem, Math. Mag. 21 (1948), 167-184, 237-254.
-
[24]
X. Sun, E. W. Cheney, The fundamentality of sets of ridge
functions, Aequationes Math. 44 (1992), no. 2-3, 226-235.
-
[25]
M. E. Valle, W. L. Vital and G. Vieira, Universal
approximation theorem for vector- and hypercomplex-valued neural networks,
Neural Netw. 180 (2024), 106632.