License: CC BY 4.0
arXiv:2404.00408v1 [cs.LG] 30 Mar 2024

Deep Learning with Parametric Lenses

Geoffrey S. H. Cruttwell Mount Allison University, Canada Bruno Gavranović University of Strathclyde, United Kingdom Neil Ghani University of Strathclyde, United Kingdom Paul Wilson Independent, United Kingdom  and  Fabio Zanasi University College London, United Kingdom, and University of Bologna, Italy
Abstract.

We propose a categorical semantics for machine learning algorithms in terms of lenses, parametric maps, and reverse derivative categories. This foundation provides a powerful explanatory and unifying framework: it encompasses a variety of gradient descent algorithms such as ADAM, AdaGrad, and Nesterov momentum, as well as a variety of loss functions such as MSE and Softmax cross-entropy, and different architectures, shedding new light on their similarities and differences. Furthermore, our approach to learning has examples generalising beyond the familiar continuous domains (modelled in categories of smooth maps) and can be realised in the discrete setting of Boolean and polynomial circuits. We demonstrate the practical significance of our framework with an implementation in Python.

Key words and phrases:
Neural network, Deep Learning, String diagram, Symmetric Monoidal Category, Cartesian Differential Category

1. Introduction

The last decade has witnessed a surge of interest in machine learning, fuelled by the numerous successes and applications that these methodologies have found in many fields of science and technology. As machine learning techniques become increasingly pervasive, algorithms and models become more sophisticated, posing a significant challenge both to the software developers and the users that need to interface, execute and maintain these systems. In spite of this rapidly evolving picture, the formal analysis of many learning algorithms mostly takes place at a heuristic level [48], or using definitions that fail to provide a general and scalable framework for describing machine learning. Indeed, it is commonly acknowledged through academia, industry, policy makers and funding agencies that there is a pressing need for a unifying perspective, which can make this growing body of work more systematic, rigorous, transparent and accessible both for users and developers [41, 52].

Consider, for example, one of the most common machine learning scenarios: supervised learning with a neural network. This technique trains the model towards a certain task, e.g. the recognition of patterns in a data set (cf. Figure 1). There are several different ways of implementing this scenario. Typically, at their core, there is a gradient update algorithm (often called the “optimiser”), depending on a given loss function, which updates in steps the parameters of the network, based on some learning rate controlling the “scaling” of the update. All of these components can vary independently in a supervised learning algorithm and a number of choices is available for loss maps (quadratic error, Softmax cross entropy, dot product, etc.) and optimisers (Adagrad [23], Momentum [44], and Adam [36], etc.).

Refer to caption
Figure 1. An informal illustration of gradient-based learning. This neural network is trained to distinguish different kinds of animals in the input image. Given an input X𝑋Xitalic_X, the network predicts an output Y𝑌Yitalic_Y, which is compared by a ‘loss map’ with what would be the correct answer (‘label’). The loss map returns a real value expressing the error of the prediction; this information, together with the learning rate (a weight controlling how much the model should be changed in response to error) is used by an optimiser, which computes by gradient-descent the update of the parameters of the network, with the aim of improving its accuracy. The neural network, the loss map, the optimiser and the learning rate are all components of a supervised learning system, and can vary independently of one another.

This scenario highlights several questions: is there a uniform mathematical language capturing the different components of the learning process? Can we develop a unifying picture of the various optimisation techniques, allowing for their comparative analysis? Moreover, it should be noted that supervised learning is not limited to neural networks. For example, supervised learning is surprisingly applicable to the discrete setting of boolean circuits [59] where continuous functions are replaced by boolean-valued functions. Can we identify an abstract perspective encompassing both the real-valued and the boolean case? In a nutshell, this paper seeks to answer the question:

what are the fundamental mathematical structures

underpinning gradient-based learning?

Our approach to this question stems from the identification of three fundamental aspects of the gradient-descent learning process:

  1. (i)

    computation is parametric, e.g. in the simplest case we are given a function f:P×XY:𝑓𝑃𝑋𝑌f:P\times X\to Yitalic_f : italic_P × italic_X → italic_Y and learning consists of finding a parameter p:P:𝑝𝑃p:Pitalic_p : italic_P such that f(p,)𝑓𝑝f(p,-)italic_f ( italic_p , - ) is the best function according to some criteria. Specifically, the weights on the internal nodes of a neural network are a parameter which the learning is seeking to optimize. Parameters also arise elsewhere, e.g. in the loss function (see later).

  2. (ii)

    information flows bidirectionally: in the forward direction, the computation turns inputs via a sequence of layers into predicted outputs, and then into a loss value; in the reverse direction, backpropagation is used to propagate the changes backwards through the layers, and then turn them into parameter updates.

  3. (iii)

    the basis of parameter update via gradient descent is differentiation e.g. in the simple case we differentiate the function mapping a parameter to its associated loss to reduce that loss.

We model bidirectionality via lenses [13, 6, 33] and based upon the above three insights, we propose the notion of parametric lens as the fundamental semantic structure of learning. In a nutshell, a parametric lens is a process with three kinds of interfaces: inputs, outputs, and parameters. On each interface, information flows both ways, i.e. computations are bidirectional. These data are best explained with our graphical representation of parametric lenses, with inputs A𝐴Aitalic_A, Asuperscript𝐴A^{\prime}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, outputs B𝐵Bitalic_B, Bsuperscript𝐵B^{\prime}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, parameters P𝑃Pitalic_P, Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and arrows indicating information flow (below left). The graphical notation also makes evident that parametric lenses are open systems, which may be composed along their interfaces (below center and right).

{tikzpicture}{tikzpicture}{tikzpicture}{tikzpicture}{tikzpicture}{tikzpicture}\begin{gathered}\scalebox{0.7}{ { \begin{tikzpicture} }}\qquad\scalebox{0.7}{{ \begin{tikzpicture} }}\qquad\vbox{\hbox{\scalebox{0.7}{{ \begin{tikzpicture} }}}}\end{gathered}start_ROW start_CELL end_CELL end_ROW (1)

This pictorial formalism is not just an intuitive sketch: as we will show, it can be understood as a completely formal (graphical) syntax using the formalism of string diagrams [43], in a way similar to how other computational phenomena have been recently analysed e.g. in quantum theory  [16], control theory  [4, 8], and digital circuit theory [29].

It is intuitively clear how parametric lenses express aspects (I) and (II) above, whereas (III) will be achieved by studying them in a space of ‘differentiable objects’ (in a sense that will be made precise). The main technical contribution of our paper is showing how the various ingredients involved in learning (the model, the optimiser, the error map and the learning rate) can be uniformly understood as being built from parametric lenses.

{tikzpicture}
Figure 2. The parametric lens that captures the learning process informally sketched in Figure 1. Note each component is a lens itself, whose composition yields the interactions described in Figure 1. Defining this picture formally will be the subject of Sections 3-4.

We will use category theory as the formal language to develop our notion of parametric lenses, and make Figure  2 mathematically precise. The categorical perspective brings several advantages, which are well-known, established principles in programming language semantics [58, 1, 47]. Three of them are particularly important to our contribution, as they constitute distinctive advantages of our semantic foundations:

𝐀𝐛𝐬𝐭𝐫𝐚𝐜𝐭𝐢𝐨𝐧𝐀𝐛𝐬𝐭𝐫𝐚𝐜𝐭𝐢𝐨𝐧\mathbf{Abstraction}bold_Abstraction:

Our approach studies which categorical structures are sufficient to perform gradient-based learning. This analysis abstracts away from the standard case of neural networks in several different ways: as we will see, it encompasses other models (namely Boolean circuits), different kinds of optimisers (including Adagrad, Adam, Nesterov momentum), and error maps (including quadratic and softmax cross entropy loss). These can be all understood as parametric lenses, and different forms of learning result from their interaction.

𝐔𝐧𝐢𝐟𝐨𝐫𝐦𝐢𝐭𝐲𝐔𝐧𝐢𝐟𝐨𝐫𝐦𝐢𝐭𝐲\mathbf{Uniformity}bold_Uniformity:

As seen in Figure 1, learning involves ingredients that are seemingly quite different: a model, an optimiser, a loss map, etc. We will show how all these notions may be seen as instances of the categorical definition of a parametric lens, thus yielding a remarkably uniform description of the learning process, and supporting our claim of parametric lenses being a fundamental semantic structure of learning.

𝐂𝐨𝐦𝐩𝐨𝐬𝐢𝐭𝐢𝐨𝐧𝐚𝐥𝐢𝐭𝐲𝐂𝐨𝐦𝐩𝐨𝐬𝐢𝐭𝐢𝐨𝐧𝐚𝐥𝐢𝐭𝐲\mathbf{Compositionality}bold_Compositionality:

The use of categorical structures to describe computation naturally enables compositional reasoning whereby complex systems are analysed in terms of smaller, and hence easier to understand, components. Compositionality is a fundamental tenet of programming language semantics; in the last few years, it has found application in the study of diverse kinds of computational models, across different fields— see e.g. [53, 28, 16, 8]. As made evident by Figure 2, our approach models a neural network as a parametric lens, resulting from the composition of simpler parametric lenses, capturing the different ingredients involved in the learning process. Moreover, as all the simpler parametric lenses are themselves composable, one may engineer a different learning process by simply plugging a new lens on the left or right of existing ones. This means that one can glue together smaller and relatively simple networks to create larger and more sophisticated neural networks.

We now give a synopsis of our contributions:

  • In Section 2, we introduce the tools necessary to define our notion of parametric lens. First, in Section 2.1, we introduce a notion of parametric categories, which amounts to a functor 𝐏𝐚𝐫𝐚()𝐏𝐚𝐫𝐚\mathbf{Para}(-)bold_Para ( - ) turning a category 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C into one 𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚𝒞\mathbf{Para}(\operatorname{\mathcal{C}})bold_Para ( caligraphic_C ) of ‘parametric 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C-maps’. Second, we recall lenses (Section 2.2). In a nutshell, a lens is a categorical morphism equipped with operations to view and update values in a certain data structure. Lenses play a prominent role in functional programming [56], as well as in the foundations of database theory [35] and more recently game theory [28]. Considering lenses in 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C simply amounts to the application of a functorial construction 𝐋𝐞𝐧𝐬()𝐋𝐞𝐧𝐬\mathbf{Lens}(-)bold_Lens ( - ), yielding 𝐋𝐞𝐧𝐬(𝒞)𝐋𝐞𝐧𝐬𝒞\mathbf{Lens}(\operatorname{\mathcal{C}})bold_Lens ( caligraphic_C ). Finally, we recall the notion of a cartesian reverse differential category (CRDC): a categorical structure axiomatising the notion of differentiation [15] (Section 2.4). We wrap up in Section 2.3, by combining these ingredients into the notion of parametric lens, formally defined as a morphism in 𝐏𝐚𝐫𝐚(𝐋𝐞𝐧𝐬(𝒞))𝐏𝐚𝐫𝐚𝐋𝐞𝐧𝐬𝒞\mathbf{Para}(\mathbf{Lens}(\operatorname{\mathcal{C}}))bold_Para ( bold_Lens ( caligraphic_C ) ) for a CRDC 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C. In terms of our desiderata (I)-(III) above, note that 𝐏𝐚𝐫𝐚()𝐏𝐚𝐫𝐚\mathbf{Para}(-)bold_Para ( - ) accounts for (I), 𝐋𝐞𝐧𝐬()𝐋𝐞𝐧𝐬\mathbf{Lens}(-)bold_Lens ( - ) accounts for (II), and the CRDC structure accounts for (III).

  • As seen in Figure 1, in the learning process there are many components at work: the model, the optimiser, the loss map, the learning rate, etc.. In Section 3, we show how the notion of parametric lens provides a uniform characterisation for such components. Moreover, for each of them, we show how different variations appearing in the literature become instances of our abstract characterisation. The plan is as follows:

    • \circ

      In Section 3.1, we show how the combinatorial model subject of the training can be seen as a parametric lens. The conditions we provide are met by the ‘standard’ case of neural networks, but also enables the study of learning for other classes of models. In particular, another instance are Boolean circuits: learning of these structures is relevant to binarisation [18] and it has been explored recently using a categorical approach [59], which turns out to be a particular case of our framework. We continue by describing internals of a model, and translating several exampls of models in deep learning to their categorical form. This includes linear layers, biases, activations, convolutional layers, but also general techniques such as weight tying and batching.

    • \circ

      In Section 3.2, we show how the loss maps associated with training are also parametric lenses. Our approach covers the cases of quadratic error, Boolean error, Softmax cross entropy, but also the ‘dot product loss’ associated with the phenomenon of deep dreaming [40, 38, 22, 51].

    • \circ

      In Section 3.3, we model the learning rate as a parametric lens. This analysis also allows us to contrast how learning rate is handled in the ‘real-valued’ case of neural networks with respect to the ‘Boolean-valued’ case of Boolean circuits.

    • \circ

      In Section 3.4, we show how optimisers can be modelled as ‘reparameterisations’ of models as parametric lenses. As case studies, in addition to basic gradient ascent and descent, we consider the stateful variants: Momentum [44], Nesterov Momentum [57], Adagrad [23], and Adam (Adaptive Moment Estimation) [36], as well as optimiser composition (Subsection 3.4.1). Also, on Boolean circuits, we show how the reverse derivative ascent of [59] can be also regarded in such way.

  • In Section 4, we study how the composition of the lenses defined in Section 3 yields a description of different kinds of learning processes.

    • \circ

      Section 4.1 is dedicated to modelling supervised learning of parameters, in the way described in Figure 1. This amounts essentially to study of the composite of lenses expressed in Figure 2, for different choices of the various components. In particular we look at (i) quadratic loss with basic gradient descent, (ii) softmax cross entropy loss with basic gradient descent, (iii) quadratic loss with Nesterov momentum, and (iv) learning in Boolean circuits with XOR loss and basic gradient ascent.

    • \circ

      In Section 4.2 we describe how a system traditionally considered as unsupervised can be recast to its supervised form: Generative Adversarial Networks ([30, 3]). We define this model abstractly as a parametric lens, and describe how a particular instantiation thereof — Wasserstein GAN ([3]) — arises as a supervised learning system with the dot product loss and the gradient descent-ascent optimiser.

    • \circ

      In order to showcase the flexibility of our approach, in Section 4.3 we depart from our ‘core’ case study of parameter learning, and turn attention to supervised learning of inputs, also called deep dreaming — the idea behind this technique is that, instead of the network parameters, one updates the inputs, in order to elicit a particular interpretation [40, 38, 22, 51]. Deep dreaming can be easily expressed within our approach, with a different rearrangement of the parametric lenses involved in the learning process, see (11) below. The abstract viewpoint of categorical semantics provides a mathematically precise and visually captivating description of the differences between the usual parameter learning process and deep dreaming.

  • In Section 5 we describe a proof-of-concept Python implementation, available at [19], based on the theory developed in this paper. This code is intended to show more concretely the payoff of our approach. Model architectures, as well as the various components participating in the learning process, are now expressed in a uniform, principled mathematical language, in terms of lenses. As a result, computing network gradients is greatly simplified, as it amounts to lens composition. Moreover, the modularity of this approach allows one to more easily tune the various parameters of training. We show our library via a number of experiments, and prove correctness by achieving accuracy on par with an equivalent model in Keras, a mainstream deep learning framework [12]. In particular, we create a working non-trivial neural network model for the MNIST image-classification problem [37].

  • Finally, in Sections 6 and 7, we discuss related and future work.

Note this paper extends a previous conference version [20]. Section 3.1 has been extended with examples of different architectures and techniques in deep learning, while Section 4.2, which considers unsupervised learning, and Remark 2, followed by a discussion on the axioms of CRDCs needed in our framework, are new. Also, we included missing proofs and complementary background material (see in particular Appendices A-B).

2. Categorical Toolkit

In this section we describe the three categorical components of our framework, each corresponding to an aspect of gradient-based learning: (I) the 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para construction (Section 2.1), which builds a category of parametric maps, (II) the 𝐋𝐞𝐧𝐬𝐋𝐞𝐧𝐬\mathbf{Lens}bold_Lens construction, which builds a category of “bidirectional” maps (Section 2.2), and (III) the combination of these two constructions into the notion of “parametric lenses” (Section 2.3). Finally (IV) we recall Cartesian reverse differential categories — categories equipped with an abstract gradient operator.

Notation

We shall use f;g𝑓𝑔f;gitalic_f ; italic_g for sequential composition of morphisms f:AB:𝑓𝐴𝐵f\colon A\to Bitalic_f : italic_A → italic_B and g:BC:𝑔𝐵𝐶g\colon B\to Citalic_g : italic_B → italic_C in a category, 1Asubscript1𝐴1_{A}1 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT for the identity morphism on A𝐴Aitalic_A, and I𝐼Iitalic_I for the unit object of a symmetric monoidal category.

2.1. Parametric Maps

In supervised learning one is typically interested in approximating a function g:nm:𝑔superscript𝑛superscript𝑚g:\operatorname{\mathbb{R}}^{n}\to\operatorname{\mathbb{R}}^{m}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT for some n𝑛nitalic_n and m𝑚mitalic_m. To do this, one begins by building a neural network, which is a smooth map f:p×nm:𝑓superscript𝑝superscript𝑛superscript𝑚f:\operatorname{\mathbb{R}}^{p}\times\operatorname{\mathbb{R}}^{n}\to% \operatorname{\mathbb{R}}^{m}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT where psuperscript𝑝\operatorname{\mathbb{R}}^{p}blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is the set of possible weights of that neural network. Then one looks for a value of qp𝑞superscript𝑝q\in\operatorname{\mathbb{R}}^{p}italic_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT such that the function f(q,):nm:𝑓𝑞superscript𝑛superscript𝑚f(q,-):\operatorname{\mathbb{R}}^{n}\to\operatorname{\mathbb{R}}^{m}italic_f ( italic_q , - ) : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT closely approximates g𝑔gitalic_g. We formalise these maps categorically via the 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para construction [26, 27, 10, 34].

{defi}

[Parametric category] Let (𝒞,,I)𝒞tensor-product𝐼(\operatorname{\mathcal{C}},\otimes,I)( caligraphic_C , ⊗ , italic_I ) be a strict111One can also define 𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚𝒞\mathbf{Para}(\operatorname{\mathcal{C}})bold_Para ( caligraphic_C ) in the case when 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C is non-strict; however, the result would be not a category but a bicategory. symmetric monoidal category. We define a category 𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚𝒞\mathbf{Para}(\operatorname{\mathcal{C}})bold_Para ( caligraphic_C ) with

  • objects those of 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C;

  • a map from A𝐴Aitalic_A to B𝐵Bitalic_B is a pair (P,f)𝑃𝑓(P,f)( italic_P , italic_f ), with P𝑃Pitalic_P an object of 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C and f:PAB:𝑓tensor-product𝑃𝐴𝐵f:P\otimes A\to Bitalic_f : italic_P ⊗ italic_A → italic_B;

  • the identity on A𝐴Aitalic_A is the pair (I,1A)𝐼subscript1𝐴(I,1_{A})( italic_I , 1 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) (since tensor-product\otimes is strict monoidal, IA=Atensor-product𝐼𝐴𝐴I\otimes A=Aitalic_I ⊗ italic_A = italic_A);

  • the composite of maps (P,f):AB:𝑃𝑓𝐴𝐵(P,f):A\to B( italic_P , italic_f ) : italic_A → italic_B and (P,f):BC:superscript𝑃superscript𝑓𝐵𝐶(P^{\prime},f^{\prime}):B\to C( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) : italic_B → italic_C is the pair (PP,(1Pf);f)tensor-productsuperscript𝑃𝑃tensor-productsubscript1superscript𝑃𝑓superscript𝑓(P^{\prime}\otimes P,(1_{P^{\prime}}\otimes f);f^{\prime})( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊗ italic_P , ( 1 start_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⊗ italic_f ) ; italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT );

{exa}

Take the category 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth whose objects are natural numbers and whose morphisms f:nm:𝑓𝑛𝑚f:n\to mitalic_f : italic_n → italic_m are smooth maps from nsuperscript𝑛\operatorname{\mathbb{R}}^{n}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to msuperscript𝑚\operatorname{\mathbb{R}}^{m}blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. This is a strict symmetric monoidal category with product given by addition. As described above, the category 𝐏𝐚𝐫𝐚(𝐒𝐦𝐨𝐨𝐭𝐡)𝐏𝐚𝐫𝐚𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Para}(\mathbf{Smooth})bold_Para ( bold_Smooth ) can be thought of as a category of neural networks: a map in this category from n𝑛nitalic_n to m𝑚mitalic_m consists of a choice of p𝑝pitalic_p and a map f:p×nm:𝑓superscript𝑝superscript𝑛superscript𝑚f:\operatorname{\mathbb{R}}^{p}\times\operatorname{\mathbb{R}}^{n}\to% \operatorname{\mathbb{R}}^{m}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT with psuperscript𝑝\operatorname{\mathbb{R}}^{p}blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT representing the set of possible weights of the neural network.

As we will see in the next sections, the interplay of the various components at work in the learning process becomes much clearer once represented the morphisms of 𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚𝒞\mathbf{Para}(\operatorname{\mathcal{C}})bold_Para ( caligraphic_C ) using the pictorial formalism of string diagrams, which we now recall. In fact, we will mildly massage the traditional notation for string diagrams (below left), by representing a morphism f:AB:𝑓𝐴𝐵f\colon A\to Bitalic_f : italic_A → italic_B in 𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚𝒞\mathbf{Para}(\operatorname{\mathcal{C}})bold_Para ( caligraphic_C ) as below right.

{tikzpicture}
                  
{tikzpicture}

This is to emphasise the special role played by P𝑃Pitalic_P, reflecting the fact that in machine learning data and parameters have different semantics. String diagrammatic notations also allows to neatly represent composition of maps (P,f):AB:𝑃𝑓𝐴𝐵(P,f):A\to B( italic_P , italic_f ) : italic_A → italic_B and (P,f):BC:superscript𝑃superscript𝑓𝐵𝐶(P^{\prime},f^{\prime}):B\to C( italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) : italic_B → italic_C (below left), and “reparameterisation” of (P,f):AB:𝑃𝑓𝐴𝐵(P,f):A\to B( italic_P , italic_f ) : italic_A → italic_B by a map α:QP:𝛼𝑄𝑃\alpha:Q\to Pitalic_α : italic_Q → italic_P (below right), yielding a new map (Q,(α1A);f):AB:𝑄tensor-product𝛼subscript1𝐴𝑓𝐴𝐵(Q,(\alpha\otimes 1_{A});f):A\to B( italic_Q , ( italic_α ⊗ 1 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) ; italic_f ) : italic_A → italic_B.

(2)

Intuitively, reparameterisation changes the parameter space of (P,f):AB:𝑃𝑓𝐴𝐵(P,f):A\to B( italic_P , italic_f ) : italic_A → italic_B to some other object Q𝑄Qitalic_Q, via some map α:QP:𝛼𝑄𝑃\alpha:Q\to Pitalic_α : italic_Q → italic_P. We shall see later that gradient descent and its many variants can naturally be viewed as reparameterisations.

Note coherence rules in combining the two operations in (2) just work as expected, as these diagrams can be ultimately ‘compiled’ down to string diagrams for monoidal categories.

2.2. Lenses

In machine learning (or even learning in general) it is fundamental that information flows both forwards and backwards: the ‘forward’ flow corresponds to a model’s predictions, and the ‘backwards’ flow to corrections to the model. The category of lenses is the ideal setting to capture this type of structure, as it is a category consisting of maps with both a “forward” and a “backward” part.

{defi}

For any Cartesian category 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C, the category of (bimorphic) lenses in 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C, 𝐋𝐞𝐧𝐬(𝒞)𝐋𝐞𝐧𝐬𝒞\mathbf{Lens}(\operatorname{\mathcal{C}})bold_Lens ( caligraphic_C ), is the category with the following data:

  • Objects are pairs (A,A)𝐴superscript𝐴(A,A^{\prime})( italic_A , italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) of objects in 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C, written as (AA)FRACOP𝐴superscript𝐴\left({{A}\atop{A^{\prime}}}\right)( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG );

  • A map from (AA)FRACOP𝐴superscript𝐴\left({{A}\atop{A^{\prime}}}\right)( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) to (BB)FRACOP𝐵superscript𝐵\left({{B}\atop{B^{\prime}}}\right)( FRACOP start_ARG italic_B end_ARG start_ARG italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) consists of a pair (f,f*)𝑓superscript𝑓(f,f^{*})( italic_f , italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) (also written as (ff*)FRACOP𝑓superscript𝑓\left({{f}\atop{f^{*}}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG )) where f:AB:𝑓𝐴𝐵f:A\to Bitalic_f : italic_A → italic_B (called the get or forward part of the lens) and f*:A×BA:superscript𝑓𝐴superscript𝐵superscript𝐴f^{*}:A\times B^{\prime}\to A^{\prime}italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT : italic_A × italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (called the put or backwards part of the lens);

  • The identity on (AA)FRACOP𝐴superscript𝐴\left({{A}\atop{A^{\prime}}}\right)( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) is the pair (1Aπ1)FRACOPsubscript1𝐴subscript𝜋1\left({{1_{A}}\atop{\pi_{1}}}\right)( FRACOP start_ARG 1 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG );

  • The composite of (ff*):(AA)(BB):FRACOP𝑓superscript𝑓FRACOP𝐴superscript𝐴FRACOP𝐵superscript𝐵\left({{f}\atop{f^{*}}}\right):\left({{A}\atop{A^{\prime}}}\right)\to\left({{B% }\atop{B^{\prime}}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ) : ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) → ( FRACOP start_ARG italic_B end_ARG start_ARG italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) and (gg*):(BB)(CC):FRACOP𝑔superscript𝑔FRACOP𝐵superscript𝐵FRACOP𝐶superscript𝐶\left({{g}\atop{g^{*}}}\right):\left({{B}\atop{B^{\prime}}}\right)\to\left({{C% }\atop{C^{\prime}}}\right)( FRACOP start_ARG italic_g end_ARG start_ARG italic_g start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ) : ( FRACOP start_ARG italic_B end_ARG start_ARG italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) → ( FRACOP start_ARG italic_C end_ARG start_ARG italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) is given by get f;g𝑓𝑔f;gitalic_f ; italic_g and put π0,π0;f,π1;g*;f*subscript𝜋0subscript𝜋0𝑓subscript𝜋1superscript𝑔superscript𝑓\langle\pi_{0},\langle\pi_{0};f,\pi_{1}\rangle;g^{*}\rangle;f^{*}⟨ italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⟨ italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_f , italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ; italic_g start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ⟩ ; italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT.

The embedding of 𝐋𝐞𝐧𝐬(𝒞)𝐋𝐞𝐧𝐬𝒞\mathbf{Lens}(\operatorname{\mathcal{C}})bold_Lens ( caligraphic_C ) into the category of Tambara modules over 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C (see [7, Thm. 23]) provides a rich string diagrammatic language, in which lenses may be represented with forward/backward wires indicating the information flow. In this language, a morphism (ff*):(AA)(BB):FRACOP𝑓superscript𝑓FRACOP𝐴superscript𝐴FRACOP𝐵superscript𝐵\left({{f}\atop{f^{*}}}\right):\left({{A}\atop{A^{\prime}}}\right)\to\left({{B% }\atop{B^{\prime}}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ) : ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) → ( FRACOP start_ARG italic_B end_ARG start_ARG italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) is written as below left, which can be ‘expanded’ as below right.

{tikzpicture}
            
{tikzpicture}

It is clear in this language how to describe the composite of (ff*):(AA)(BB):FRACOP𝑓superscript𝑓FRACOP𝐴superscript𝐴FRACOP𝐵superscript𝐵\left({{f}\atop{f^{*}}}\right):\left({{A}\atop{A^{\prime}}}\right)\to\left({{B% }\atop{B^{\prime}}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ) : ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) → ( FRACOP start_ARG italic_B end_ARG start_ARG italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) and (gg*):(BB)(CC):FRACOP𝑔superscript𝑔FRACOP𝐵superscript𝐵FRACOP𝐶superscript𝐶\left({{g}\atop{g^{*}}}\right):\left({{B}\atop{B^{\prime}}}\right)\to\left({{C% }\atop{C^{\prime}}}\right)( FRACOP start_ARG italic_g end_ARG start_ARG italic_g start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ) : ( FRACOP start_ARG italic_B end_ARG start_ARG italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) → ( FRACOP start_ARG italic_C end_ARG start_ARG italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ):

{tikzpicture}
(3)
Remark 1.

Note 𝐋𝐞𝐧𝐬(𝒞)𝐋𝐞𝐧𝐬𝒞\mathbf{Lens}(\operatorname{\mathcal{C}})bold_Lens ( caligraphic_C ) is a monoidal category, with (AA)(BB)tensor-product𝐹𝑅𝐴𝐶𝑂𝑃𝐴superscript𝐴normal-′𝐹𝑅𝐴𝐶𝑂𝑃𝐵superscript𝐵normal-′\left({{A}\atop{A^{\prime}}}\right)\otimes\left({{B}\atop{B^{\prime}}}\right)( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) ⊗ ( FRACOP start_ARG italic_B end_ARG start_ARG italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) defined as (A×BA×B)𝐹𝑅𝐴𝐶𝑂𝑃𝐴𝐵superscript𝐴normal-′superscript𝐵normal-′\left({{A\times B}\atop{A^{\prime}\times B^{\prime}}}\right)( FRACOP start_ARG italic_A × italic_B end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ). However, in general 𝐋𝐞𝐧𝐬(𝒞)𝐋𝐞𝐧𝐬𝒞\mathbf{Lens}(\operatorname{\mathcal{C}})bold_Lens ( caligraphic_C ) is not itself Cartesian. This is easy to see when looking at even a terminal object: if T𝑇Titalic_T is a terminal object in 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C, then in general (TT)𝐹𝑅𝐴𝐶𝑂𝑃𝑇𝑇\left({{T}\atop{T}}\right)( FRACOP start_ARG italic_T end_ARG start_ARG italic_T end_ARG ) will not be a terminal object in 𝐋𝐞𝐧𝐬(𝒞)𝐋𝐞𝐧𝐬𝒞\mathbf{Lens}(\operatorname{\mathcal{C}})bold_Lens ( caligraphic_C ) — it if was, there would be a unique lens (!A!A*):(AA)(TT)\left({{!_{A}}\atop{!_{A}^{*}}}\right):\left({{A}\atop{A^{\prime}}}\right)\to% \left({{T}\atop{T}}\right)( FRACOP start_ARG ! start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_ARG start_ARG ! start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ) : ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) → ( FRACOP start_ARG italic_T end_ARG start_ARG italic_T end_ARG ) whose put part would need to be a (unique) map A×TAnormal-→𝐴𝑇superscript𝐴normal-′A\times T\to A^{\prime}italic_A × italic_T → italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, but in general there are many such maps.

2.3. Parametric Lenses

The fundamental category where supervised learning takes place is the composite 𝐏𝐚𝐫𝐚(𝐋𝐞𝐧𝐬(𝒞))𝐏𝐚𝐫𝐚𝐋𝐞𝐧𝐬𝒞\mathbf{Para}(\mathbf{Lens}(\operatorname{\mathcal{C}}))bold_Para ( bold_Lens ( caligraphic_C ) ) of the two constructions in the previous sections: {defi} The category 𝐏𝐚𝐫𝐚(𝐋𝐞𝐧𝐬(𝒞))𝐏𝐚𝐫𝐚𝐋𝐞𝐧𝐬𝒞\mathbf{Para}(\mathbf{Lens}(\operatorname{\mathcal{C}}))bold_Para ( bold_Lens ( caligraphic_C ) ) of parametric lenses on 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C is defined as follows.

  • Its objects are pairs of objects (AA)FRACOP𝐴superscript𝐴\left({{A}\atop{A^{\prime}}}\right)( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) of objects from 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C;

  • A morphism from (AA)FRACOP𝐴superscript𝐴\left({{A}\atop{A^{\prime}}}\right)( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) to (BB)FRACOP𝐵superscript𝐵\left({{B}\atop{B^{\prime}}}\right)( FRACOP start_ARG italic_B end_ARG start_ARG italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ), called a parametric lens222In [26], these are called learners. However, in this paper we study them in a much broader light; see Section 6., is a choice of parameter pair (PP)FRACOP𝑃superscript𝑃\left({{P}\atop{P^{\prime}}}\right)( FRACOP start_ARG italic_P end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) and a lens (ff*):(PP)×(AA)(BB):FRACOP𝑓superscript𝑓FRACOP𝑃superscript𝑃FRACOP𝐴superscript𝐴FRACOP𝐵superscript𝐵\left({{f}\atop{f^{*}}}\right):\left({{P}\atop{P^{\prime}}}\right)\times\left(% {{A}\atop{A^{\prime}}}\right)\to\left({{B}\atop{B^{\prime}}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ) : ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) × ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) → ( FRACOP start_ARG italic_B end_ARG start_ARG italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) where f:P×AB:𝑓𝑃𝐴𝐵f:P\times A\to Bitalic_f : italic_P × italic_A → italic_B and f*:P×A×BP×A:superscript𝑓𝑃𝐴superscript𝐵superscript𝑃superscript𝐴f^{*}:P\times A\times B^{\prime}\to P^{\prime}\times A^{\prime}italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT : italic_P × italic_A × italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

String diagrams for parametric lenses are built by simply composing the graphical languages of the previous two sections — see (1), where respectively a morphism, a composition of morphisms, and a reparameterisation are depicted.

Given a generic morphism in 𝐏𝐚𝐫𝐚(𝐋𝐞𝐧𝐬(𝒞))𝐏𝐚𝐫𝐚𝐋𝐞𝐧𝐬𝒞\mathbf{Para}(\mathbf{Lens}(\operatorname{\mathcal{C}}))bold_Para ( bold_Lens ( caligraphic_C ) ) as depicted in (1) on the left, one can see how it is possible to “learn” new values from f𝑓fitalic_f: it takes as input an input A𝐴Aitalic_A, a parameter P𝑃Pitalic_P, and a change Bsuperscript𝐵B^{\prime}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and outputs a change in A𝐴Aitalic_A, a value of B𝐵Bitalic_B, and a change Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This last element is the key component for supervised learning: intuitively, it says how to change the parameter values to get the neural network closer to the true value of the desired function.

The question, then, is how one is to define such a parametric lens given nothing more than a neural network, ie., a parametric map (P,f):AB:𝑃𝑓𝐴𝐵(P,f):A\to B( italic_P , italic_f ) : italic_A → italic_B. This is precisely what the gradient operation provides, and its generalization to categories is explored in the next subsection.

2.4. Cartesian Reverse Differential Categories

Fundamental to all types of gradient-based learning is, of course, the gradient operation. In most cases this gradient operation is performed in the category of smooth maps between Euclidean spaces. However, recent work [59] has shown that gradient-based learning can also work well in other categories; for example, in a category of boolean circuits. Thus, to encompass these examples in a single framework, we will work in a category with an abstract gradient operation.

{defi}

A Cartesian left additive category [15, Defn. 1] consists of a category 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C with chosen finite products (including a terminal object), and an addition operation and zero morphism in each homset, satisfying various axioms. A Cartesian reverse differential category (CRDC) [15, Defn. 13] consists of a Cartesian left additive category 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C, together with an operation which provides, for each map f:AB in 𝒞:𝑓𝐴𝐵 in 𝒞f:A\to B\mbox{ in }\operatorname{\mathcal{C}}italic_f : italic_A → italic_B in caligraphic_C, a map R[f]:A×BA:𝑅delimited-[]𝑓𝐴𝐵𝐴R[f]:A\times B\to Aitalic_R [ italic_f ] : italic_A × italic_B → italic_A satisfying various axioms (see (Def. B)).

For f:AB:𝑓𝐴𝐵f:A\to Bitalic_f : italic_A → italic_B, the pair (fR[f])FRACOP𝑓𝑅delimited-[]𝑓\left({{f}\atop{R[f]}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_R [ italic_f ] end_ARG ) forms a lens from (AA)FRACOP𝐴𝐴\left({{A}\atop{A}}\right)( FRACOP start_ARG italic_A end_ARG start_ARG italic_A end_ARG ) to (BB)FRACOP𝐵𝐵\left({{B}\atop{B}}\right)( FRACOP start_ARG italic_B end_ARG start_ARG italic_B end_ARG ). We will pursue the idea that R[f]𝑅delimited-[]𝑓R[f]italic_R [ italic_f ] acts as backwards map, thus giving a means to “learn”f𝑓fitalic_f.

Note that assigning type A×BA𝐴𝐵𝐴A\times B\to Aitalic_A × italic_B → italic_A to R[f]𝑅delimited-[]𝑓R[f]italic_R [ italic_f ] hides some relevant information: B𝐵Bitalic_B-values in the domain and A𝐴Aitalic_A-values in the codomain of R[f]𝑅delimited-[]𝑓R[f]italic_R [ italic_f ] do not play the same role as values of the same types in f:AB:𝑓𝐴𝐵f\colon A\to Bitalic_f : italic_A → italic_B: in R[f]𝑅delimited-[]𝑓R[f]italic_R [ italic_f ], they really take in a tangent vector at B𝐵Bitalic_B and output a tangent vector at A𝐴Aitalic_A (cf. the definition of R[f]𝑅delimited-[]𝑓R[f]italic_R [ italic_f ] in 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth, Example 2.4 below). To emphasise this, we will type R[f]𝑅delimited-[]𝑓R[f]italic_R [ italic_f ] as a map A×BA𝐴superscript𝐵superscript𝐴A\times B^{\prime}\to A^{\prime}italic_A × italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (even though in reality A=A𝐴superscript𝐴A=A^{\prime}italic_A = italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and B=B𝐵superscript𝐵B=B^{\prime}italic_B = italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT), thus meaning that (fR[f])FRACOP𝑓𝑅delimited-[]𝑓\left({{f}\atop{R[f]}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_R [ italic_f ] end_ARG ) is actually a lens from (AA)FRACOP𝐴superscript𝐴\left({{A}\atop{A^{\prime}}}\right)( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) to (BB)FRACOP𝐵superscript𝐵\left({{B}\atop{B^{\prime}}}\right)( FRACOP start_ARG italic_B end_ARG start_ARG italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ). This typing distinction will be helpful later on, when we want to add additional components to our learning algorithms.

The following two examples of CRDCs will serve as the basis for the learning scenarios of the upcoming sections.

{exa}

The category 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth (Example 2.1) is Cartesian with product given by addition, and it is also a Cartesian reverse differential category: given a smooth map f:nm:𝑓superscript𝑛superscript𝑚f:\mathbb{R}^{n}\to\mathbb{R}^{m}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, the map R[f]:n×mn:𝑅delimited-[]𝑓superscript𝑛superscript𝑚superscript𝑛R[f]\colon\mathbb{R}^{n}\times\mathbb{R}^{m}\to\mathbb{R}^{n}italic_R [ italic_f ] : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT sends a pair (x,v)𝑥𝑣(x,v)( italic_x , italic_v ) to J[f]T(x)v𝐽superscriptdelimited-[]𝑓𝑇𝑥𝑣J[f]^{T}(x)\cdot vitalic_J [ italic_f ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_x ) ⋅ italic_v: the transpose of the Jacobian of f𝑓fitalic_f at x𝑥xitalic_x in the direction v𝑣vitalic_v. For example, if f:23:𝑓superscript2superscript3f:\mathbb{R}^{2}\to\mathbb{R}^{3}italic_f : blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is defined as f(x1,x2):=(x13+2x1x2,x2,sin(x1))assign𝑓subscript𝑥1subscript𝑥2superscriptsubscript𝑥132subscript𝑥1subscript𝑥2subscript𝑥2subscript𝑥1f(x_{1},x_{2}):=(x_{1}^{3}+2x_{1}x_{2},x_{2},\sin(x_{1}))italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) := ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + 2 italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , roman_sin ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ), then R[f]:2×32:𝑅delimited-[]𝑓superscript2superscript3superscript2R[f]:\mathbb{R}^{2}\times\mathbb{R}^{3}\to\mathbb{R}^{2}italic_R [ italic_f ] : blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is given by

(x,v)[3x12+2x20cos(x1)2x110][v1v2v3].maps-to𝑥𝑣matrix3superscriptsubscript𝑥122subscript𝑥20subscript𝑥12subscript𝑥110matrixsubscript𝑣1subscript𝑣2subscript𝑣3(x,v)\mapsto\begin{bmatrix}3x_{1}^{2}+2x_{2}&0&\cos(x_{1})\\ 2x_{1}&1&0\end{bmatrix}\cdot\begin{bmatrix}v_{1}\\ v_{2}\\ v_{3}\end{bmatrix}.( italic_x , italic_v ) ↦ [ start_ARG start_ROW start_CELL 3 italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL roman_cos ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 2 italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] ⋅ [ start_ARG start_ROW start_CELL italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] .

Using the reverse derivative (as opposed to the forward derivative) is well-known to be much more computationally efficient for functions f:nm:𝑓superscript𝑛superscript𝑚f:\mathbb{R}^{n}\to\mathbb{R}^{m}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT when mnmuch-less-than𝑚𝑛m\ll nitalic_m ≪ italic_n (for example, see [31]), as is the case in most supervised learning situations (where often m=1𝑚1m=1italic_m = 1).

{exa}

Another CRDC is the symmetric monoidal category POLY2subscriptPOLYsubscript2\mathrm{POLY}_{\mathbb{Z}_{2}}roman_POLY start_POSTSUBSCRIPT blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [15, Example 14] with objects the natural numbers and morphisms f:AB:𝑓𝐴𝐵f\colon A\to Bitalic_f : italic_A → italic_B the B𝐵Bitalic_B-tuples of polynomials 2[x1xA]subscript2delimited-[]subscript𝑥1subscript𝑥𝐴\mathbb{Z}_{2}[x_{1}\ldots x_{A}]blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ]. When presented by generators and relations these morphisms can be viewed as a syntax for boolean circuits, with parametric lenses for such circuits (and their reverse derivative) described in [59].

Remark 2.

The definition of a CRDC (see Def. B) satisfies 7 axioms describing the interaction of the reverse differential operator with the rest of the cartesian left-additive structure. As pointed out in [14, Sec. 2.3], the last two axioms are independent of the others. In this paper we will additionally see that these two axioms are also not needed to compositionally model the update step of supervised learning.

The remark above can be stated abstractly in two steps. Firstly, we note that a CRDC 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C for each morphism f𝑓fitalic_f defines a lens (fR[f])FRACOP𝑓𝑅delimited-[]𝑓\left({{f}\atop{R[f]}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_R [ italic_f ] end_ARG ) whose backwards map is additive in the second component (see Def. B for the definition of additivity in the second component). This defines a subcategory 𝐋𝐞𝐧𝐬A(𝒞)subscript𝐋𝐞𝐧𝐬𝐴𝒞\mathbf{Lens}_{A}(\operatorname{\mathcal{C}})bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ) of 𝐋𝐞𝐧𝐬(𝒞)𝐋𝐞𝐧𝐬𝒞\mathbf{Lens}(\operatorname{\mathcal{C}})bold_Lens ( caligraphic_C ), and the following functor. {restatable}theoremLensFunctorCLAC Lenses with backward passes additive in the second component form a functor

𝐋𝐞𝐧𝐬A:𝐂𝐋𝐀𝐂𝐚𝐭𝐂𝐋𝐀𝐂𝐚𝐭:subscript𝐋𝐞𝐧𝐬𝐴𝐂𝐋𝐀𝐂𝐚𝐭𝐂𝐋𝐀𝐂𝐚𝐭\mathbf{Lens}_{A}:\mathbf{CLACat}\to\mathbf{CLACat}bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT : bold_CLACat → bold_CLACat
Proof 2.1.

See Appendix. Definition B contains the definition of the category 𝐂𝐋𝐀𝐂𝐚𝐭𝐂𝐋𝐀𝐂𝐚𝐭\mathbf{CLACat}bold_CLACat, Definition 5 the definition of 𝐋𝐞𝐧𝐬A(𝒞)subscript𝐋𝐞𝐧𝐬𝐴𝒞\mathbf{Lens}_{A}(\operatorname{\mathcal{C}})bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ), Prop. 6 the proof that 𝐋𝐞𝐧𝐬A(𝒞)subscript𝐋𝐞𝐧𝐬𝐴𝒞\mathbf{Lens}_{A}(\operatorname{\mathcal{C}})bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ) is a cartesian left-additive category and Prop. 7 the action of 𝐋𝐞𝐧𝐬Asubscript𝐋𝐞𝐧𝐬𝐴\mathbf{Lens}_{A}bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT on morphisms.

The second step is the observation that a coalgebra of this functor gives us a choice of a cartesian left-additive category 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C and a cartesian left-additive functor 𝐑𝒞:𝒞𝐋𝐞𝐧𝐬A(𝒞):subscript𝐑𝒞𝒞subscript𝐋𝐞𝐧𝐬𝐴𝒞\mathbf{R}_{\operatorname{\mathcal{C}}}:\operatorname{\mathcal{C}}\to\mathbf{% Lens}_{A}(\operatorname{\mathcal{C}})bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT : caligraphic_C → bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ) such that a number of axioms are satisfied: precisely the first five axioms of a CRDC.

{restatable}

[(compare [15, Prop. 31])]propFiveOutOfSeven A coalgebra of the copointed 𝐋𝐞𝐧𝐬Asubscript𝐋𝐞𝐧𝐬𝐴\mathbf{Lens}_{A}bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT endofunctor gives rise to a cartesian left-additive category 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C equipped with a reverse differential combinator R𝑅Ritalic_R which satisfies the first five axioms of a cartesian reverse derivative category.

Proof 2.2.

A coalgebra of 𝐋𝐞𝐧𝐬Asubscript𝐋𝐞𝐧𝐬𝐴\mathbf{Lens}_{A}bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT consists of a category 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C and a cartesian left-additive functor 𝐑𝒞:𝒞𝐋𝐞𝐧𝐬A(𝒞)normal-:subscript𝐑𝒞normal-→𝒞subscript𝐋𝐞𝐧𝐬𝐴𝒞\mathbf{R}_{\operatorname{\mathcal{C}}}:\operatorname{\mathcal{C}}\to\mathbf{% Lens}_{A}(\operatorname{\mathcal{C}})bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT : caligraphic_C → bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ) such that the following diagram commutes.

{tikzcd}{tikzcd}\begin{tikzcd} (4)

The data of the functor 𝐑𝒞subscript𝐑𝒞\mathbf{R}_{\operatorname{\mathcal{C}}}bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT is equivalent to the data of a reverse differential combinator R𝑅Ritalic_R: every map f:ABnormal-:𝑓normal-→𝐴𝐵f:A\to Bitalic_f : italic_A → italic_B in 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C is mapped it to a lens whose forward part is by the Eq. 4 above restricted to be f𝑓fitalic_f itself, leaving the only choice involved in this functor the one of the backward part. What remains to prove is that this backward part satisfies the first five axioms of a CRDC. We do this in Appendix B.

We will see in the next section how only the data of the functor 𝐑𝒞:𝒞𝐋𝐞𝐧𝐬A(𝒞):subscript𝐑𝒞𝒞subscript𝐋𝐞𝐧𝐬𝐴𝒞\mathbf{R}_{\operatorname{\mathcal{C}}}:\operatorname{\mathcal{C}}\to\mathbf{% Lens}_{A}(\operatorname{\mathcal{C}})bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT : caligraphic_C → bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ) is used to model supervised learning, justifying our claim that only the first five axioms of a CRDC are used.

3. Components of learning as Parametric Lenses

As seen in the introduction, in the learning process there are many components at work: a model, an optimiser, a loss map, a learning rate, etc. In this section we show how each such component can be understood as a parametric lens. Moreover, for each component, we show how our framework encompasses several variations of the gradient-descent algorithms, thus offering a unifying perspective on many different approaches that appear in the literature.

3.1. Models as Parametric Lenses

We begin by characterising the models used for training as parametric lenses. In essence, our approach identifies a set of abstract requirements necessary to perform training by gradient descent, which covers the case studies that we will consider in the next sections.

The leading intuition is that a suitable model is a parametric map, equipped with a reverse derivative operator. Using the formal developments of Section 2, this amounts to assuming that a model is a morphism in 𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚𝒞\mathbf{Para}(\operatorname{\mathcal{C}})bold_Para ( caligraphic_C ), for a CRDC 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C. In order to visualise such morphism as a parametric lens, it then suffices to apply under 𝐏𝐚𝐫𝐚()𝐏𝐚𝐫𝐚\mathbf{Para}(-)bold_Para ( - ) the canonical morphism 𝐑𝒞:𝒞𝐋𝐞𝐧𝐬(𝒞):subscript𝐑𝒞𝒞𝐋𝐞𝐧𝐬𝒞\mathbf{R}_{\operatorname{\mathcal{C}}}\colon\operatorname{\mathcal{C}}\to% \mathbf{Lens}(\operatorname{\mathcal{C}})bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT : caligraphic_C → bold_Lens ( caligraphic_C ) (which exists for any CRDC 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C, see Prop. 2.1)333Here we are treating 𝐑𝒞subscript𝐑𝒞\mathbf{R}_{\operatorname{\mathcal{C}}}bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT as postcomposed with the inclusion 𝐋𝐞𝐧𝐬A(𝒞)𝐋𝐞𝐧𝐬(𝒞)subscript𝐋𝐞𝐧𝐬𝐴𝒞𝐋𝐞𝐧𝐬𝒞\mathbf{Lens}_{A}(\operatorname{\mathcal{C}})\hookrightarrow\mathbf{Lens}(% \operatorname{\mathcal{C}})bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ) ↪ bold_Lens ( caligraphic_C )., mapping f𝑓fitalic_f to (fR[f])FRACOP𝑓𝑅delimited-[]𝑓\left({{f}\atop{R[f]}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_R [ italic_f ] end_ARG ). This yields a functor 𝐏𝐚𝐫𝐚(𝐑𝒞):𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚(𝐋𝐞𝐧𝐬(𝒞)):𝐏𝐚𝐫𝐚subscript𝐑𝒞𝐏𝐚𝐫𝐚𝒞𝐏𝐚𝐫𝐚𝐋𝐞𝐧𝐬𝒞\mathbf{Para}(\mathbf{R}_{\operatorname{\mathcal{C}}}):\mathbf{Para}(% \operatorname{\mathcal{C}})\to\mathbf{Para}(\mathbf{Lens}(\operatorname{% \mathcal{C}}))bold_Para ( bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ) : bold_Para ( caligraphic_C ) → bold_Para ( bold_Lens ( caligraphic_C ) ), pictorially defined as

{tikzpicture}{tikzpicture}{tikzpicture}maps-to{tikzpicture}\scalebox{0.7}{{ \begin{tikzpicture} }}\qquad\mapsto\qquad\scalebox{0.6}{{ \begin{tikzpicture} }} (5)
{exa}

[Neural networks] As noted previously, to learn a function of type nmsuperscript𝑛superscript𝑚\operatorname{\mathbb{R}}^{n}\to\operatorname{\mathbb{R}}^{m}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, one constructs a neural network, which can be seen as a function of type p×nmsuperscript𝑝superscript𝑛superscript𝑚\operatorname{\mathbb{R}}^{p}\times\operatorname{\mathbb{R}}^{n}\to% \operatorname{\mathbb{R}}^{m}blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT where psuperscript𝑝\operatorname{\mathbb{R}}^{p}blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT is the space of parameters of the neural network. As seen in Example 2.1, this is a map in the category 𝐏𝐚𝐫𝐚(𝐒𝐦𝐨𝐨𝐭𝐡)𝐏𝐚𝐫𝐚𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Para}(\mathbf{Smooth})bold_Para ( bold_Smooth ) of type nmsuperscript𝑛superscript𝑚\operatorname{\mathbb{R}}^{n}\to\operatorname{\mathbb{R}}^{m}blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT with parameter space psuperscript𝑝\operatorname{\mathbb{R}}^{p}blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Then one can apply the functor in (5) to present a neural network together with its reverse derivative operator as a parametric lens, i.e. a morphism in 𝐏𝐚𝐫𝐚(𝐋𝐞𝐧𝐬(𝐒𝐦𝐨𝐨𝐭𝐡))𝐏𝐚𝐫𝐚𝐋𝐞𝐧𝐬𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Para}(\mathbf{Lens}(\mathbf{Smooth}))bold_Para ( bold_Lens ( bold_Smooth ) ).

{exa}

[Boolean and Polynomial circuits] For learning of Boolean circuits as described in [59], the recipe is the same as in Example 3.1, except that the base category is POLY2subscriptPOLYsubscript2\mathrm{POLY}_{\mathbb{Z}_{2}}roman_POLY start_POSTSUBSCRIPT blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT (see Example 2.4). The important observation here is that POLY2subscriptPOLYsubscript2\mathrm{POLY}_{\mathbb{Z}_{2}}roman_POLY start_POSTSUBSCRIPT blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a CRDC, see [15, 59], and thus we can apply the functor in (5). Note this setting can be generalised to circuits over any polynomial ring, see [61].

Note a model/parametric lens f𝑓fitalic_f takes as inputs an element of A𝐴Aitalic_A, a parameter P𝑃Pitalic_P, an element of Bsuperscript𝐵B^{\prime}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (a change in B𝐵Bitalic_B) and outputs an element of B𝐵Bitalic_B, a change in A𝐴Aitalic_A, and a change in P𝑃Pitalic_P. This is not yet sufficient to do machine learning! When we perform learning, we want to input a parameter P𝑃Pitalic_P and a pair A×B𝐴𝐵A\times Bitalic_A × italic_B and receive a new parameter P𝑃Pitalic_P. Instead, f𝑓fitalic_f expects a change in B𝐵Bitalic_B (not an element of B𝐵Bitalic_B) and outputs a change in P𝑃Pitalic_P (not an element of P𝑃Pitalic_P). Deep dreaming, on the other hand, wants to return an element of A𝐴Aitalic_A (not a change in A𝐴Aitalic_A). Thus, to do machine learning (or deep dreaming) we need to add additional components to f𝑓fitalic_f; we will consider these additional components in the next sections.

We now proceed to describe the internals of a model, and translate several examples of models in deep learning to their categorical form. But before doing so, we clarify some terminology. While ‘layers’ and ‘models’ are both parametric maps, the former typically refers to components of larger models, while the latter refers to the final model to be learned in the manner described in Section 4.1).

Remark 3.

An unfortunate ambiguity in deep learning terminology is the second meaning of ‘layer’. For example, the ‘hidden layer’ of a model refers to internal values of a neural network, corresponding to the ‘wires’ of a string diagram. We will avoid using the term ‘layer’ in this sense unless explicitly noted.

In deep learning, one often speaks of ‘model architectures’. This can mean either a specific model (e.g., ResNet50 [32]) or a family of models employing a particular technique. For example, one says a model has a ‘convolutional architecture’ when it contains a convolutional layer (Example 3.1.1). Examples of layers, models, and architectures are given in the following sections.

3.1.1. Layers

The dense or fully-connected layer is a component of many neural network architectures. In the categorical viewpoint, a dense layer is the composition of linear, bias, and activation layers, which we describe first. Unless explicitly noted, we will assume for simplicity that most layers are maps in 𝐏𝐚𝐫𝐚(𝐒𝐦𝐨𝐨𝐭𝐡)𝐏𝐚𝐫𝐚𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Para}(\mathbf{Smooth})bold_Para ( bold_Smooth ). However, many of the maps defined here only require a multiplication map as additional structure, and so can be defined in any cartesian distributive category [60] such as POLYSsubscriptPOLY𝑆\mathrm{POLY}_{S}roman_POLY start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT.

{exa}

[Linear layer] A linear or matrix-multiply layer is the parametric map (nm,𝗅𝗂𝗇𝖾𝖺𝗋:nm×nm):superscript𝑛𝑚𝗅𝗂𝗇𝖾𝖺𝗋superscript𝑛𝑚superscript𝑛superscript𝑚(\operatorname{\mathbb{R}}^{nm},\mathsf{linear}:\operatorname{\mathbb{R}}^{nm}% \times\operatorname{\mathbb{R}}^{n}\to\operatorname{\mathbb{R}}^{m})( blackboard_R start_POSTSUPERSCRIPT italic_n italic_m end_POSTSUPERSCRIPT , sansserif_linear : blackboard_R start_POSTSUPERSCRIPT italic_n italic_m end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ), where 𝗅𝗂𝗇𝖾𝖺𝗋(M,x)=Mx𝗅𝗂𝗇𝖾𝖺𝗋𝑀𝑥𝑀𝑥\mathsf{linear}(M,x)=M\cdot xsansserif_linear ( italic_M , italic_x ) = italic_M ⋅ italic_x is the matrix-vector product of M𝑀Mitalic_M interpreted as matrix coefficients and the vector x𝑥xitalic_x.

Note that layers are best thought of as families of maps. For example, there is a linear layer morphism for all choices of dimension m𝑚mitalic_m and n𝑛nitalic_n.

{exa}

[Bias layer] A bias layer is a parametric map (n,+)superscript𝑛(\operatorname{\mathbb{R}}^{n},+)( blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , + ), where +:n×nn+:\operatorname{\mathbb{R}}^{n}\times\operatorname{\mathbb{R}}^{n}\to% \operatorname{\mathbb{R}}^{n}+ : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the cartesian left-additive addition map. The reverse derivative of +++ is the copy map, so we may depict the bias layer as a parametric lens as below.

{tikzpicture}{tikzpicture}{\begin{tikzpicture}}

An activation layer (I,α:AnAn):𝐼𝛼superscript𝐴𝑛superscript𝐴𝑛(I,\alpha:A^{n}\to A^{n})( italic_I , italic_α : italic_A start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → italic_A start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) is a typically nonlinear, trivially parametric map often applied to the output of another layer. Many are simply the n𝑛nitalic_n-fold tensor product of a univariate function AA𝐴𝐴A\to Aitalic_A → italic_A.

{exa}

[Sigmoid Activation] The sigmoid activation layer (I,𝗌𝗂𝗀𝗆𝗈𝗂𝖽:nn):𝐼𝗌𝗂𝗀𝗆𝗈𝗂𝖽superscript𝑛superscript𝑛(I,\mathsf{sigmoid}:\operatorname{\mathbb{R}}^{n}\to\operatorname{\mathbb{R}}^% {n})( italic_I , sansserif_sigmoid : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) is the n𝑛nitalic_n-fold tensor product σ×𝑛×σ𝜎𝑛𝜎\sigma\times\overset{n}{\ldots}\times\sigmaitalic_σ × overitalic_n start_ARG … end_ARG × italic_σ of the sigmoid function σ(x)=exp(x)exp(x)+1𝜎𝑥𝑥𝑥1\sigma(x)=\frac{\exp(x)}{\exp(x)+1}italic_σ ( italic_x ) = divide start_ARG roman_exp ( italic_x ) end_ARG start_ARG roman_exp ( italic_x ) + 1 end_ARG.

Note that unlike other layers considered so far, while sigmoid is a map in 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth, it is not a map in POLYSsubscriptPOLY𝑆\mathrm{POLY}_{S}roman_POLY start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. An example of an activation layer which is not in 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth is the ReLU map.

{exa}

[ReLU Activation] The ‘Rectified Linear Unit’ activation function is the map 𝖱𝖾𝖫𝖴(x)=δ>0(x)x𝖱𝖾𝖫𝖴𝑥subscript𝛿absent0𝑥𝑥\mathsf{ReLU}(x)=\delta_{>0}(x)\cdot xsansserif_ReLU ( italic_x ) = italic_δ start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT ( italic_x ) ⋅ italic_x where δ>0subscript𝛿absent0\delta_{>0}italic_δ start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT is the positive indicator function. The 𝖱𝖾𝖫𝖴𝖱𝖾𝖫𝖴\mathsf{ReLU}sansserif_ReLU activation layer 𝖱𝖾𝖫𝖴:AnAn:𝖱𝖾𝖫𝖴superscript𝐴𝑛superscript𝐴𝑛\mathsf{ReLU}:A^{n}\to A^{n}sansserif_ReLU : italic_A start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → italic_A start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is again the n𝑛nitalic_n-fold tensor product of this function. Although ReLU is not a smooth map, some presentations of RDCs can be extended via Theorem 3.1 of [60] to include the positive indicator function δ>0:AA:subscript𝛿absent0𝐴𝐴\delta_{>0}:A\to Aitalic_δ start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT : italic_A → italic_A. The 𝖱𝖾𝖫𝖴𝖱𝖾𝖫𝖴\mathsf{ReLU}sansserif_ReLU function can then be expressed in terms of this function, and its reverse derivative can be derived as 𝖱[𝖱𝖾𝖫𝖴](x,δy)=δ>0(x)δy𝖱delimited-[]𝖱𝖾𝖫𝖴𝑥subscript𝛿𝑦subscript𝛿absent0𝑥subscript𝛿𝑦\mathsf{R}[\mathsf{ReLU}](x,\delta_{y})=\delta_{>0}(x)\cdot\delta_{y}sansserif_R [ sansserif_ReLU ] ( italic_x , italic_δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) = italic_δ start_POSTSUBSCRIPT > 0 end_POSTSUBSCRIPT ( italic_x ) ⋅ italic_δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT.

The combination of linear, bias, and choice of activation layer gives a dense or fully-connected layer.

{exa}

[Dense Layer] A dense or fully-connected layer is the following composition.

{tikzpicture}
(6)

Where some choice of input dimension m𝑚m\in\operatorname{\mathbb{N}}italic_m ∈ blackboard_N, output dimension m𝑚m\in\operatorname{\mathbb{N}}italic_m ∈ blackboard_N, and activation layer α:mn:𝛼superscript𝑚superscript𝑛\alpha:\operatorname{\mathbb{R}}^{m}\to\operatorname{\mathbb{R}}^{n}italic_α : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is assumed. Note that the activation layer has no parameters.

The final two examples of layers we cover here are convolutional and max-pooling layers, which are common in models for image-processing tasks.

{exa}

[Convolutional Layer] A convolutional layer is a map (k2,𝖼𝗈𝗇𝗏𝗈𝗅𝗏𝖾𝟤𝖣:k2×m2n2):superscriptsuperscript𝑘2𝖼𝗈𝗇𝗏𝗈𝗅𝗏𝖾𝟤𝖣superscriptsuperscript𝑘2superscriptsuperscript𝑚2superscriptsuperscript𝑛2(\operatorname{\mathbb{R}}^{k^{2}},\mathsf{convolve2D}:\operatorname{\mathbb{R% }}^{k^{2}}\times\operatorname{\mathbb{R}}^{m^{2}}\to\operatorname{\mathbb{R}}^% {n^{2}})( blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , sansserif_convolve2D : blackboard_R start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) where 𝖼𝗈𝗇𝗏𝗈𝗅𝗏𝖾𝟤𝖣𝖼𝗈𝗇𝗏𝗈𝗅𝗏𝖾𝟤𝖣\mathsf{convolve2D}sansserif_convolve2D denotes the discrete 2D convolution of a k×k𝑘𝑘k\times kitalic_k × italic_k kernel and an m×m𝑚𝑚m\times mitalic_m × italic_m image. The output of a convolutional layer is an n×n𝑛𝑛n\times nitalic_n × italic_n image with n=max(m,k)min(m,k)+1𝑛𝑚𝑘𝑚𝑘1n=\max(m,k)-\min(m,k)+1italic_n = roman_max ( italic_m , italic_k ) - roman_min ( italic_m , italic_k ) + 1.

A number of further variations of the convolutional layer exist in the literature, but the basic idea is to use 2D convolution to map a kernel (the parameters) over the input. Convolutional layers are frequently composed with max-pooling layers.

{exa}

[Max-Pooling Layer] A max-pooling layer (I,𝗆𝖺𝗑𝗉𝗈𝗈𝗅:S(kn)2Sn):𝐼𝗆𝖺𝗑𝗉𝗈𝗈𝗅superscript𝑆superscript𝑘𝑛2superscript𝑆𝑛(I,\mathsf{maxpool}:S^{(kn)^{2}}\to S^{n})( italic_I , sansserif_maxpool : italic_S start_POSTSUPERSCRIPT ( italic_k italic_n ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT → italic_S start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) computes the maximum of each of the n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT size-k×k𝑘𝑘k\times kitalic_k × italic_k subregions of the input image.

However, as with the 𝖱𝖾𝖫𝖴𝖱𝖾𝖫𝖴\mathsf{ReLU}sansserif_ReLU layer, max-pooling layers cannot be thought of as maps in 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth. Nevertheless, by again appealing to [60, Theorem 3.1], one can extend a presentation of RDCs to include a function max:21:21\max:2\to 1roman_max : 2 → 1 from which the max-pooling layer and its reverse derivative can be derived.

3.1.2. Architectures

We now consider some examples of neural network architectures defined in terms of the layers above. Since both layers and architectures are just parametric maps, we can consider the layers by themselves as architectures already, and in fact the linear and dense layers are sufficient to solve some simple machine learning problems. The first non-trivial architecture we consider is the ‘single hidden layer’ network.

{exa}

[Hidden Layer Network] A neural network with a single ‘hidden layer’ (in the sense of Remark 3) is the composition of two 𝖽𝖾𝗇𝗌𝖾𝖽𝖾𝗇𝗌𝖾\mathsf{dense}sansserif_dense maps.

{tikzpicture}

We emphasize that the term ‘hidden layer’ here ambiguously refers to the central wires labeled bsuperscript𝑏\operatorname{\mathbb{R}}^{b}blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT rather than the dense morphisms.

This architecture is demonstrated in detail in the experiments [19] accompanying this paper. Also included in our experiments is a convolutional model for classifying images of handwritten digits (the MNIST [37] dataset). A simplified version is below.

{exa}

[Convolutional Architecture] First, define a 𝖢𝖯𝖱𝖢𝖯𝖱\mathsf{CPR}sansserif_CPR layer as the composition of a convolution, max-pooling, and 𝖱𝖾𝖫𝖴𝖱𝖾𝖫𝖴\mathsf{ReLU}sansserif_ReLU layer.

CPR={tikzpicture}𝐶𝑃𝑅{tikzpicture}CPR\>=\>\scalebox{0.8}{{ \begin{tikzpicture} }}italic_C italic_P italic_R =

where n=max(m,k)min(m,k)+1𝑛𝑚𝑘𝑚𝑘1n=\max(m,k)-\min(m,k)+1italic_n = roman_max ( italic_m , italic_k ) - roman_min ( italic_m , italic_k ) + 1. One can then define a convolutional architecture for classifying 28×28282828\times 2828 × 28-pixel images of the MNIST dataset into digits 09090-90 - 9 as follows.

{tikzpicture}

Lastly, we mention the architecture of Generative Adversarial Networks which e define and thoroughly discuss in Section 4.2.

3.1.3. Weight Tying and Batching

A number of general techniques are employed in designing deep learning models. We now describe two examples of these techniques in terms of their categorical interpretations. The first is weight-tying, which is required in Subsection 4.2 to define a more complex architecture for unsupervised learning: the GAN.

{exa}

[Weight Tying] Weight tying is the sharing of parameters between two different components. Categorically speaking, this means using the copy map on parameters as below.

{tikzpicture}

Note that f𝑓fitalic_f and g𝑔gitalic_g have the same parameters, but might be applied to different parts of the input. In this sense, one can think of convolutional layers (Example 3.1.1) as using weight-tying.

A related technique is batching. So far, we have considered learning as updating a model with a single data example (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) at each timestep. However, it is common for efficiency purposes to update the model using a batch of examples: a finite number n𝑛nitalic_n of examples (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for i{0n}𝑖0𝑛i\in\{0\ldots n\}italic_i ∈ { 0 … italic_n }.

{exa}

[Batching] Suppose we have a model (P,f:P×AB):𝑃𝑓𝑃𝐴𝐵(P,f:P\times A\to B)( italic_P , italic_f : italic_P × italic_A → italic_B ). The batched model with batch size n𝑛nitalic_n is a parametric map (P,f:P×An:Bn):𝑃superscript𝑓𝑃superscript𝐴𝑛:superscript𝐵𝑛(P,f^{\prime}:P\times A^{n}:B^{n})( italic_P , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT : italic_P × italic_A start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT : italic_B start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) where fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT consists of n𝑛nitalic_n copies of f𝑓fitalic_f applied to each input, but with the same parameters. For example, when n=2𝑛2n=2italic_n = 2, the batch update is as follows.

{tikzpicture}

The above diagrams highlight the relationship between weight-tying and batching. However, note that these techniques serve different purposes: while weight-tying can be used to reduce the number of weights in a model, batching is used for efficiency reasons to update a model with multiple examples in parallel.

3.2. Loss Maps as Parametric Lenses

Another key component of any learning algorithm is the choice of loss map. This gives a measurement of how far the current output of the model is from the desired output. In standard learning in 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth, this loss map is viewed as a map of type B×B𝐵𝐵B\times B\to\operatorname{\mathbb{R}}italic_B × italic_B → blackboard_R. However, in our setup, this is naturally viewed as a parametric map from B𝐵Bitalic_B to \operatorname{\mathbb{R}}blackboard_R with parameter space B𝐵Bitalic_B.444Here the loss map has its parameter space equal to its input space. However, putting loss maps on the same footing as models lends itself to further generalizations where the parameter space is different, and where the loss map can itself be learned. See Generative Adversarial Networks, [10, Figure 7.]. We also generalize the codomain to an arbitrary object L𝐿Litalic_L.

{defi}

A loss map on B𝐵Bitalic_B consists of a parametric map (B,loss):𝐏𝐚𝐫𝐚(𝒞)(B,L):𝐵loss𝐏𝐚𝐫𝐚𝒞𝐵𝐿(B,\mbox{loss}):\mathbf{Para}(\operatorname{\mathcal{C}})(B,L)( italic_B , loss ) : bold_Para ( caligraphic_C ) ( italic_B , italic_L ) for some object L𝐿Litalic_L.

Note that we can precompose a loss map (B,loss):BL:𝐵loss𝐵𝐿(B,\mbox{loss})\colon B\to L( italic_B , loss ) : italic_B → italic_L with a neural network (P,f):AB:𝑃𝑓𝐴𝐵(P,f):A\to B( italic_P , italic_f ) : italic_A → italic_B (below left), and apply the functor in (5) (with 𝒞=𝐒𝐦𝐨𝐨𝐭𝐡𝒞𝐒𝐦𝐨𝐨𝐭𝐡\operatorname{\mathcal{C}}=\mathbf{Smooth}caligraphic_C = bold_Smooth) to obtain the parametric lens below right.

{tikzpicture}{tikzpicture}{tikzpicture}maps-to{tikzpicture}\begin{gathered}\scalebox{0.8}{{ \begin{tikzpicture} }}\quad\mapsto\quad\scalebox{0.8}{{ \begin{tikzpicture} }}\end{gathered}start_ROW start_CELL ↦ end_CELL end_ROW (7)

This is getting closer to the parametric lens we want: it can now receive inputs of type B𝐵Bitalic_B. However, this is at the cost of now needing an input to Lsuperscript𝐿L^{\prime}italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT; we consider how to handle this in the next section.

{exa}

[Quadratic error] In 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth, the standard loss function on bsuperscript𝑏\operatorname{\mathbb{R}}^{b}blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT is quadratic error: it uses L=𝐿L=\operatorname{\mathbb{R}}italic_L = blackboard_R and has parametric map e:b×b:𝑒superscript𝑏superscript𝑏e:\operatorname{\mathbb{R}}^{b}\times\operatorname{\mathbb{R}}^{b}\to% \operatorname{\mathbb{R}}italic_e : blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT → blackboard_R given by

e(bt,bp)=12i=1b((bp)i(bt)i)2𝑒subscript𝑏𝑡subscript𝑏𝑝12superscriptsubscript𝑖1𝑏superscriptsubscriptsubscript𝑏𝑝𝑖subscriptsubscript𝑏𝑡𝑖2e(b_{t},b_{p})=\frac{1}{2}\sum_{i=1}^{b}((b_{p})_{i}-(b_{t})_{i})^{2}italic_e ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( ( italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where we think of btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the “true” value and bpsubscript𝑏𝑝b_{p}italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT the predicted value. This has reverse derivative R[e]:b×b×b×b:𝑅delimited-[]𝑒superscript𝑏superscript𝑏superscript𝑏superscript𝑏R[e]:\operatorname{\mathbb{R}}^{b}\times\operatorname{\mathbb{R}}^{b}\times% \operatorname{\mathbb{R}}\to\operatorname{\mathbb{R}}^{b}\times\operatorname{% \mathbb{R}}^{b}italic_R [ italic_e ] : blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × blackboard_R → blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT given by R[e](bt,bp,α)=α(bpbt,btbp)𝑅delimited-[]𝑒subscript𝑏𝑡subscript𝑏𝑝𝛼𝛼subscript𝑏𝑝subscript𝑏𝑡subscript𝑏𝑡subscript𝑏𝑝R[e](b_{t},b_{p},\alpha)=\alpha\cdot(b_{p}-b_{t},b_{t}-b_{p})italic_R [ italic_e ] ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_α ) = italic_α ⋅ ( italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) — note α𝛼\alphaitalic_α suggests the idea of learning rate, which we will explore in Section 3.3.

{exa}

[Boolean error] In POLY2subscriptPOLYsubscript2\mathrm{POLY}_{\mathbb{Z}_{2}}roman_POLY start_POSTSUBSCRIPT blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the loss function on bsuperscript𝑏\mathbb{Z}^{b}blackboard_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT which is implicitly used in [59] is a bit different: it uses L=b𝐿superscript𝑏L=\mathbb{Z}^{b}italic_L = blackboard_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT and has parametric map e:b×bb:𝑒superscript𝑏superscript𝑏superscript𝑏e:\mathbb{Z}^{b}\times\mathbb{Z}^{b}\to\mathbb{Z}^{b}italic_e : blackboard_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × blackboard_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT → blackboard_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT given by

e(bt,bp)=bt+bp.𝑒subscript𝑏𝑡subscript𝑏𝑝subscript𝑏𝑡subscript𝑏𝑝e(b_{t},b_{p})=b_{t}+b_{p}.italic_e ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT .

(Note that this is +++ in Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT; equivalently this is given by XOR.) Its reverse derivative is of type R[e]:b×b×bb×b:𝑅delimited-[]𝑒superscript𝑏superscript𝑏superscript𝑏superscript𝑏superscript𝑏R[e]:\mathbb{Z}^{b}\times\mathbb{Z}^{b}\times\mathbb{Z}^{b}\to\mathbb{Z}^{b}% \times\mathbb{Z}^{b}italic_R [ italic_e ] : blackboard_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × blackboard_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × blackboard_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT → blackboard_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × blackboard_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT given by R[e](bt,bp,α)=(α,α)𝑅delimited-[]𝑒subscript𝑏𝑡subscript𝑏𝑝𝛼𝛼𝛼R[e](b_{t},b_{p},\alpha)=(\alpha,\alpha)italic_R [ italic_e ] ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_α ) = ( italic_α , italic_α ).

{exa}

[Softmax cross entropy] The Softmax cross entropy loss is a bsuperscript𝑏\operatorname{\mathbb{R}}^{b}blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT-parametric map bsuperscript𝑏\operatorname{\mathbb{R}}^{b}\to\operatorname{\mathbb{R}}blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT → blackboard_R defined by

e(bt,bp)=i=1b(bt)i((bp)ilog(Softmax(bp)i))𝑒subscript𝑏𝑡subscript𝑏𝑝superscriptsubscript𝑖1𝑏subscriptsubscript𝑏𝑡𝑖subscriptsubscript𝑏𝑝𝑖Softmaxsubscriptsubscript𝑏𝑝𝑖e(b_{t},b_{p})=\sum_{i=1}^{b}(b_{t})_{i}((b_{p})_{i}-\log(\mathrm{Softmax}(b_{% p})_{i}))italic_e ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ( italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_log ( roman_Softmax ( italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )

where Softmax(bp)=exp((bp)i)j=1bexp((bp)j)Softmaxsubscript𝑏𝑝subscriptsubscript𝑏𝑝𝑖superscriptsubscript𝑗1𝑏subscriptsubscript𝑏𝑝𝑗\mathrm{Softmax}(b_{p})=\frac{\exp((b_{p})_{i})}{\sum_{j=1}^{b}\exp((b_{p})_{j% })}roman_Softmax ( italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = divide start_ARG roman_exp ( ( italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT roman_exp ( ( italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG is defined componentwise for each class i𝑖iitalic_i.

We note that, although btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT needs to be a probability distribution, at the moment there is no need to ponder the question of interaction of probability distributions with the reverse derivative framework: one can simply consider btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the image of some logits under the SoftmaxSoftmax\mathrm{Softmax}roman_Softmax function.

{exa}

[Dot product] In Deep Dreaming (Section 4.3) we often want to focus only on a particular element of the network output bsuperscript𝑏\operatorname{\mathbb{R}}^{b}blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT. This is done by supplying a one-hot vector btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the ground truth to the loss function e(bt,bp)=btbp𝑒subscript𝑏𝑡subscript𝑏𝑝subscript𝑏𝑡subscript𝑏𝑝e(b_{t},b_{p})=b_{t}\cdot b_{p}italic_e ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT which computes the dot product of two vectors. If the ground truth vector y𝑦yitalic_y is a one-hot vector (active at the i𝑖iitalic_i-th element), then the dot product performs masking of all inputs except the i𝑖iitalic_i-th one. Note the reverse derivative R[e]:b×b×b×b:𝑅delimited-[]𝑒superscript𝑏superscript𝑏superscript𝑏superscript𝑏R[e]\colon\operatorname{\mathbb{R}}^{b}\times\operatorname{\mathbb{R}}^{b}% \times\operatorname{\mathbb{R}}\to\operatorname{\mathbb{R}}^{b}\times% \operatorname{\mathbb{R}}^{b}italic_R [ italic_e ] : blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × blackboard_R → blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT of the dot product is defined as R[e](bt,bp,α)=(αbp,αbt)𝑅delimited-[]𝑒subscript𝑏𝑡subscript𝑏𝑝𝛼𝛼subscript𝑏𝑝𝛼subscript𝑏𝑡R[e](b_{t},b_{p},\alpha)=(\alpha\cdot b_{p},\alpha\cdot b_{t})italic_R [ italic_e ] ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_α ) = ( italic_α ⋅ italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_α ⋅ italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

3.3. Learning Rates as Parametric Lenses

After models and loss maps, another ingredient of the learning process are learning rates, which we formalise as follows.

{defi}

A learning rate α𝛼\alphaitalic_α on L𝐿Litalic_L consists of a lens from (LL)FRACOP𝐿superscript𝐿\left({{L}\atop{L^{\prime}}}\right)( FRACOP start_ARG italic_L end_ARG start_ARG italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) to (11)FRACOP11\left({{1}\atop{1}}\right)( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ) where 1111 is a terminal object in 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C. Note that the get component of the learning rate lens must be the unique map to 1111, while the put component is a map L×1L𝐿1superscript𝐿L\times 1\to L^{\prime}italic_L × 1 → italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT; that is, simply a map α*:LL:superscript𝛼𝐿superscript𝐿\alpha^{*}:L\to L^{\prime}italic_α start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT : italic_L → italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Thus we can view α𝛼\alphaitalic_α as a parametric lens from (LL)(11)FRACOP𝐿superscript𝐿FRACOP11\left({{L}\atop{L^{\prime}}}\right)\to\left({{1}\atop{1}}\right)( FRACOP start_ARG italic_L end_ARG start_ARG italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) → ( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ) (with trivial parameter space) and compose it in 𝐏𝐚𝐫𝐚(𝐋𝐞𝐧𝐬(𝒞))𝐏𝐚𝐫𝐚𝐋𝐞𝐧𝐬𝒞\mathbf{Para}(\mathbf{Lens}(\operatorname{\mathcal{C}}))bold_Para ( bold_Lens ( caligraphic_C ) ) with a model and a loss map (cf. (7)) to get

{tikzpicture}
(8)
{exa}

In standard supervised learning in 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth, one fixes some ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 as a learning rate, and this is used to define α𝛼\alphaitalic_α: α𝛼\alphaitalic_α is simply constantly ϵitalic-ϵ-\epsilon- italic_ϵ, ie., α(l)=ϵ𝛼𝑙italic-ϵ\alpha(l)=-\epsilonitalic_α ( italic_l ) = - italic_ϵ for any lL𝑙𝐿l\in Litalic_l ∈ italic_L.

{exa}

In supervised learning in POLY2subscriptPOLYsubscript2\mathrm{POLY}_{\mathbb{Z}_{2}}roman_POLY start_POSTSUBSCRIPT blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the standard learning rate is quite different: for a given L𝐿Litalic_L it is defined as the identity function, α(l)=l𝛼𝑙𝑙\alpha(l)=litalic_α ( italic_l ) = italic_l.

Other learning rate morphisms are possible as well: for example, one could fix some ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 and define a learning rate in 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth by α(l)=ϵl𝛼𝑙italic-ϵ𝑙\alpha(l)=-\epsilon\cdot litalic_α ( italic_l ) = - italic_ϵ ⋅ italic_l. Such a choice would take into account how far away the network is from its desired goal and adjust the learning rate accordingly.

3.4. Optimisers as Reparameterisations

In this section we consider how to implement gradient descent, ascent, and other gradient updates into our framework. To this aim, note that the parametric lens (fR[f])FRACOP𝑓𝑅delimited-[]𝑓\left({{f}\atop{R[f]}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_R [ italic_f ] end_ARG ) representing our model (see (5)) outputs a Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which represents a change in the parameter space. Now, we would like to receive not just the requested change in the parameter, but the new parameter itself. This is precisely what gradient update accomplishes, when formalised as a lens. We start by describing gradient ascent and gradient descent.

{defi}

[Gradient ascent] Let 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C be a CRDC. Gradient ascent on P:𝒞:𝑃𝒞P:\operatorname{\mathcal{C}}italic_P : caligraphic_C is a lens

(idP+P):(PP)(PP):FRACOPsubscriptid𝑃subscript𝑃FRACOP𝑃𝑃FRACOP𝑃superscript𝑃\left({{\text{id}_{P}}\atop{+_{P}}}\right):\left({{P}\atop{P}}\right)\to\left(% {{P}\atop{P^{\prime}}}\right)( FRACOP start_ARG id start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG start_ARG + start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG ) : ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P end_ARG ) → ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG )

where +P:P×PP+_{P}:P\times P^{\prime}\to P+ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT : italic_P × italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_P is the monoid structure of P𝑃Pitalic_P.555Note that as in the discussion in Section 2.4, we are implicitly assuming that P=P𝑃superscript𝑃P=P^{\prime}italic_P = italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT; we have merely notated them differently to emphasize the different “roles” they play (the first P𝑃Pitalic_P can be thought of as “points”, the second as “vectors”).

{tikzpicture}
Figure 3. Gradient Ascent

Intuitively, such a lens allows one to receive the requested change in parameter and implement that change by adding that value to the current parameter. By its type, we can now “plug” the gradient descent lens G:(PP)(PP):𝐺FRACOP𝑃𝑃FRACOP𝑃superscript𝑃G\colon\left({{P}\atop{P}}\right)\to\left({{P}\atop{P^{\prime}}}\right)italic_G : ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P end_ARG ) → ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) above the model (fR[f])FRACOP𝑓𝑅delimited-[]𝑓\left({{f}\atop{R[f]}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_R [ italic_f ] end_ARG ) in (5) — formally, this is accomplished as a reparameterisation of the parametric morphism (fR[f])FRACOP𝑓𝑅delimited-[]𝑓\left({{f}\atop{R[f]}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_R [ italic_f ] end_ARG ), cf. Section 2.1. This gives us Figure 4 (left).

{tikzpicture}
{tikzpicture}
Figure 4. Model reparameterised by basic gradient descent (left) and a generic stateful optimiser (right).
{exa}

[Gradient ascent in Smooth] In 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth, the gradient ascent reparameterisation will take the output from Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and add it to the current value of P𝑃Pitalic_P to get a new value of P𝑃Pitalic_P.

{exa}

[Gradient ascent in Boolean circuits] In the CRDC POLY2subscriptPOLYsubscript2\mathrm{POLY}_{\mathbb{Z}_{2}}roman_POLY start_POSTSUBSCRIPT blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the gradient ascent reparameterisation will again take the output from Psuperscript𝑃P^{\prime}italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and add it to the current value of P𝑃Pitalic_P to get a new value of P𝑃Pitalic_P; however, since +++ in 2subscript2\mathbb{Z}_{2}blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the same as XOR, this can be also be seen as taking the XOR of the current parameter and the requested change; this is exactly how this algorithm is implemented in [59].

{defi}

[Gradient descent] Let 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C be a CRDC where every monoid is additionally a commutative group.666Since a homomorphism between groups needs to satisfy less equations than a monoid homomorphism, this means that every monoid homomorphism is also a group homomorphism. This in turn means there are no extra conditions we need to impose on such a CRDC equipped with group objects. Gradient descent on P𝑃Pitalic_P is a lens

(idPP):(PP)(PP):FRACOPsubscriptid𝑃subscript𝑃FRACOP𝑃𝑃FRACOP𝑃superscript𝑃\left({{\text{id}_{P}}\atop{-_{P}}}\right):\left({{P}\atop{P}}\right)\to\left(% {{P}\atop{P^{\prime}}}\right)( FRACOP start_ARG id start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG start_ARG - start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG ) : ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P end_ARG ) → ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG )

where P:(p,p)=pp-_{P}:(p,p^{\prime})=p-p^{\prime}- start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT : ( italic_p , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_p - italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

{tikzpicture}
Figure 5. Gradient Descent

In 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth this instantiates to usual gradient descent. Gradient ascent in POLY2subscriptPOLYsubscript2\mathrm{POLY}_{\mathbb{Z}_{2}}roman_POLY start_POSTSUBSCRIPT blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is equal to gradient descent because 𝖷𝖮𝖱𝖷𝖮𝖱\mathsf{XOR}sansserif_XOR is its own inverse. Intuitively in POLY2subscriptPOLYsubscript2\mathrm{POLY}_{\mathbb{Z}_{2}}roman_POLY start_POSTSUBSCRIPT blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT there is always only one direction we can move (other than staying still): it’s flipping the bit. Gradient descent and ascent are not usually seen as a lens — but they fit precisely into this picture that we are creating.

Other variants of gradient descent also fit naturally into this framework by allowing for additional input/output data with P𝑃Pitalic_P. In particular, many of them keep track of the history of previous updates and use that to inform the next one. This is easy to model in our setup: instead of asking for a lens (PP)(PP)FRACOP𝑃𝑃FRACOP𝑃superscript𝑃\left({{P}\atop{P}}\right)\to\left({{P}\atop{P^{\prime}}}\right)( FRACOP start_ARG italic_P end_ARG start_ARG italic_P end_ARG ) → ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ), we ask instead for a lens (S×PS×P)(PP)FRACOP𝑆𝑃𝑆𝑃FRACOP𝑃superscript𝑃\left({{S\times P}\atop{S\times P}}\right)\to\left({{P}\atop{P^{\prime}}}\right)( FRACOP start_ARG italic_S × italic_P end_ARG start_ARG italic_S × italic_P end_ARG ) → ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) where S𝑆Sitalic_S is some “state” object.

{defi}

A stateful parameter update consists of a choice of object S𝑆Sitalic_S (the state object) and a lens U:(S×PS×P)(PP):𝑈FRACOP𝑆𝑃𝑆𝑃FRACOP𝑃superscript𝑃U:\left({{S\times P}\atop{S\times P}}\right)\to\left({{P}\atop{P^{\prime}}}\right)italic_U : ( FRACOP start_ARG italic_S × italic_P end_ARG start_ARG italic_S × italic_P end_ARG ) → ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ).

Again, we view this optimiser as a reparameterisation which may be “plugged in” a model as in Figure  4 (right). Let us now consider how several well-known optimisers can be implemented in this way.

{exa}

[Momentum] In the momentum variant of gradient descent, one keeps track of the previous change and uses this to inform how the current parameter should be changed. Thus, in this case, we set S=P𝑆𝑃S=Pitalic_S = italic_P, fix some γ>0𝛾0\gamma>0italic_γ > 0, and define the momentum lens (UU*):(P×PP×P)(PP):FRACOP𝑈superscript𝑈FRACOP𝑃𝑃𝑃𝑃FRACOP𝑃superscript𝑃\left({{U}\atop{U^{*}}}\right):\left({{P\times P}\atop{P\times P}}\right)\to% \left({{P}\atop{P^{\prime}}}\right)( FRACOP start_ARG italic_U end_ARG start_ARG italic_U start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ) : ( FRACOP start_ARG italic_P × italic_P end_ARG start_ARG italic_P × italic_P end_ARG ) → ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ). by U(s,p)=p𝑈𝑠𝑝𝑝U(s,p)=pitalic_U ( italic_s , italic_p ) = italic_p and U*(s,p,p)=(s,p+s)superscript𝑈𝑠𝑝superscript𝑝superscript𝑠𝑝superscript𝑠U^{*}(s,p,p^{\prime})=(s^{\prime},p+s^{\prime})italic_U start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_p , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p + italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where s=γs+psuperscript𝑠𝛾𝑠superscript𝑝s^{\prime}=-\gamma s+p^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = - italic_γ italic_s + italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Note momentum recovers gradient descent when γ=0𝛾0\gamma=0italic_γ = 0.

In both standard gradient descent and momentum, our lens representation has trivial get part. However, as soon as we move to more complicated variants, this is not anymore the case, as for instance in Nesterov momentum below.

{exa}

[Nesterov momentum] In Nesterov momentum, one uses the momentum from previous updates to tweak the input parameter supplied to the network. We can precisely capture this by using a small variation of the lens in the previous example. Again, we set S=P𝑆𝑃S=Pitalic_S = italic_P, fix some γ>0𝛾0\gamma>0italic_γ > 0, and define the Nesterov momentum lens (UU*):(P×PP×P)(PP):FRACOP𝑈superscript𝑈FRACOP𝑃𝑃𝑃𝑃FRACOP𝑃superscript𝑃\left({{U}\atop{U^{*}}}\right):\left({{P\times P}\atop{P\times P}}\right)\to% \left({{P}\atop{P^{\prime}}}\right)( FRACOP start_ARG italic_U end_ARG start_ARG italic_U start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ) : ( FRACOP start_ARG italic_P × italic_P end_ARG start_ARG italic_P × italic_P end_ARG ) → ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) by U(s,p)=p+γs𝑈𝑠𝑝𝑝𝛾𝑠U(s,p)=p+\gamma sitalic_U ( italic_s , italic_p ) = italic_p + italic_γ italic_s and U*superscript𝑈U^{*}italic_U start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT as in the previous example.

{exa}

[Adagrad] Given any fixed ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 and δ107similar-to𝛿superscript107\delta\sim 10^{-7}italic_δ ∼ 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT, Adagrad [23] is given by S=P𝑆𝑃S=Pitalic_S = italic_P, with the lens whose 𝗀𝖾𝗍𝗀𝖾𝗍\mathsf{get}sansserif_get part is (g,p)pmaps-to𝑔𝑝𝑝(g,p)\mapsto p( italic_g , italic_p ) ↦ italic_p. The 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}sansserif_put is (g,p,p)(g,p+ϵδ+gp)maps-to𝑔𝑝superscript𝑝superscript𝑔𝑝direct-productitalic-ϵ𝛿superscript𝑔superscript𝑝(g,p,p^{\prime})\mapsto(g^{\prime},p+\frac{\epsilon}{\delta+\sqrt{g^{\prime}}}% \odot p^{\prime})( italic_g , italic_p , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ↦ ( italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p + divide start_ARG italic_ϵ end_ARG start_ARG italic_δ + square-root start_ARG italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG end_ARG ⊙ italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where g=g+ppsuperscript𝑔𝑔direct-productsuperscript𝑝superscript𝑝g^{\prime}=g+p^{\prime}\odot p^{\prime}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_g + italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊙ italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and direct-product\odot is the elementwise (Hadamard) product. Unlike with other optimization algorithms where the learning rate is the same for all parameters, Adagrad divides the learning rate of each individual parameter with the square root of the past accumulated gradients.

{exa}

[Adam] Adaptive Moment Estimation (Adam) [36] is another method that computes adaptive learning rates for each parameter by storing exponentially decaying average of past gradients (m𝑚mitalic_m) and past squared gradients (v𝑣vitalic_v). For fixed β1,β2[0,1)subscript𝛽1subscript𝛽201\beta_{1},\beta_{2}\in[0,1)italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ 0 , 1 ), ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, and δ108similar-to𝛿superscript108\delta\sim 10^{-8}italic_δ ∼ 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT, Adam is given by S=P×P𝑆𝑃𝑃S=P\times Pitalic_S = italic_P × italic_P, with the lens whose 𝗀𝖾𝗍𝗀𝖾𝗍\mathsf{get}sansserif_get part is (m,v,p)pmaps-to𝑚𝑣𝑝𝑝(m,v,p)\mapsto p( italic_m , italic_v , italic_p ) ↦ italic_p and whose 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}sansserif_put part is 𝗉𝗎𝗍(m,v,p,p)=(m^,v^,p+ϵδ+v^m^)𝗉𝗎𝗍𝑚𝑣𝑝superscript𝑝superscript^𝑚superscript^𝑣𝑝direct-productitalic-ϵ𝛿superscript^𝑣superscript^𝑚\mathsf{put}{}(m,v,p,p^{\prime})=(\widehat{m}^{\prime},\widehat{v}^{\prime},p+% \frac{\epsilon}{\delta+\sqrt{\widehat{v}^{\prime}}}\odot\widehat{m}^{\prime})sansserif_put ( italic_m , italic_v , italic_p , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p + divide start_ARG italic_ϵ end_ARG start_ARG italic_δ + square-root start_ARG over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG end_ARG ⊙ over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where m=β1m+(1β1)psuperscript𝑚subscript𝛽1𝑚1subscript𝛽1superscript𝑝m^{\prime}=\beta_{1}m+(1-\beta_{1})p^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_m + ( 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, v=β2v+(1β2)p2superscript𝑣subscript𝛽2𝑣1subscript𝛽2superscript𝑝2v^{\prime}=\beta_{2}v+(1-\beta_{2})p^{\prime 2}italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_v + ( 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_p start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT, and m^=m1β1t,v^=v1β2tformulae-sequencesuperscript^𝑚superscript𝑚1superscriptsubscript𝛽1𝑡superscript^𝑣superscript𝑣1superscriptsubscript𝛽2𝑡\widehat{m}^{\prime}=\frac{m^{\prime}}{1-\beta_{1}^{t}},\widehat{v}^{\prime}=% \frac{v^{\prime}}{1-\beta_{2}^{t}}over^ start_ARG italic_m end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG , over^ start_ARG italic_v end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG.

Note that, so far, optimsers/reparameterisations have been added to the P/P𝑃superscript𝑃P/P^{\prime}italic_P / italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT wires. In order to change the model’s parameters (Fig. 4). In Section 4.3 we will study them on the A/A𝐴superscript𝐴A/A^{\prime}italic_A / italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT wires instead, giving deep dreaming.

3.4.1. Can we compose optimisers?

Even though not explicitly acknowledged in the literature, optimisers can be composed, and this composition plays an important role in settings where deep learning intersects with multivariable optimisation. In such settings we’re interested in their parallel composition, therefore giving a positive answer to the above question.777One might wonder whether optimisers can be composed in sequence as well. The apparent sequential composability of optimisers is unfortunately an artefact of our limited view without dependent types. Parallel composition of optimisers arises out of the fact that optimisers are lenses, and lenses are a monoidal category (Remark 1). In such settings we might have an optimiser of two variables which descends in one direction, and ascends in the other one, for instance.

{defi}

[Gradient descent-ascent (GDA)] Given objects P𝑃Pitalic_P and Q𝑄Qitalic_Q, gradient descent-ascent on P×Q𝑃𝑄P\times Qitalic_P × italic_Q is a lens

(P×Q𝗀𝖽𝖺):(P×QP×Q)(P×QP×Q):FRACOP𝑃𝑄𝗀𝖽𝖺FRACOP𝑃𝑄𝑃𝑄FRACOP𝑃𝑄superscript𝑃superscript𝑄\left({{P\times Q}\atop{\mathsf{gda}}}\right):\left({{P\times Q}\atop{P\times Q% }}\right)\to\left({{P\times Q}\atop{P^{\prime}\times Q^{\prime}}}\right)( FRACOP start_ARG italic_P × italic_Q end_ARG start_ARG sansserif_gda end_ARG ) : ( FRACOP start_ARG italic_P × italic_Q end_ARG start_ARG italic_P × italic_Q end_ARG ) → ( FRACOP start_ARG italic_P × italic_Q end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG )

where 𝗀𝖽𝖺(p,q,p,q)=(pp,q+q)𝗀𝖽𝖺𝑝𝑞superscript𝑝superscript𝑞𝑝superscript𝑝𝑞superscript𝑞\mathsf{gda}(p,q,p^{\prime},q^{\prime})=(p-p^{\prime},q+q^{\prime})sansserif_gda ( italic_p , italic_q , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ( italic_p - italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q + italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

In 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth this gives an optimiser which descends on P𝑃Pitalic_P and ascents on Q𝑄Qitalic_Q. In POLY2subscriptPOLYsubscript2\mathrm{POLY}_{\mathbb{Z}_{2}}roman_POLY start_POSTSUBSCRIPT blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT this map ends up computing the same update function on both parameter spaces: the one that just flips the underlying bit. This is something that ends up preventing us from modelling GANs in this setting (compare with Ex. 7 where both positive and negative polarity of the optimiser map is needed).

When it comes to optimisers of two parameters, gradient descent-ascent is a particular type of an optimiser that is a product of two optimisers. But not all optimisers can be factored in such a way, much like a general monoidal product doesn’t necessarily have to be cartesian. A good example of this is an optimiser on two parameters called competitive gradient descent ([46]). We don’t explicitly define or use it in this paper, instead inviting the reader to the aforementioned reference for more information.

4. Learning with Parametric Lenses

In the previous section we have seen how all the components of learning can be modeled as parametric lenses. We now study how all these components can be put together to form learning systems. We cover the most common examples of supervised learning, also discussing different kinds of layers, architectures, and techniques such as weight tying and batching. We also consider unsupervised learning, in the form of Generative Adversarial Networks. Finally, in addition to systems that learn parameters, we study systems that learn their inputs. This is a technique commonly known as deep dreaming, and we present it as a natural counterpart of supervised learning of parameters.

Before we describe these systems, it will be convenient to represent all the inputs and outputs of our parametric lenses as parameters. In (8), we see the P/P𝑃superscript𝑃P/P^{\prime}italic_P / italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and B/B𝐵superscript𝐵B/B^{\prime}italic_B / italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT inputs and outputs as parameters; however, the A/A𝐴superscript𝐴A/A^{\prime}italic_A / italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT wires are not. To view the A/A𝐴superscript𝐴A/A^{\prime}italic_A / italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT inputs as parameters, we compose that system with the parametric lens η𝜂\etaitalic_η we now define. The parametric lens η𝜂\etaitalic_η has the type (11)(AA)FRACOP11FRACOP𝐴superscript𝐴\left({{1}\atop{1}}\right)\to\left({{A}\atop{A^{\prime}}}\right)( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ) → ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) with parameter space (AA)FRACOP𝐴superscript𝐴\left({{A}\atop{A^{\prime}}}\right)( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) defined by (𝗀𝖾𝗍=η1A,𝗉𝗎𝗍=ηπ1)(\mathsf{get}{}_{\eta}=1_{A},\mathsf{put}{}_{\eta}=\pi_{1})( sansserif_get start_FLOATSUBSCRIPT italic_η end_FLOATSUBSCRIPT = 1 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , sansserif_put start_FLOATSUBSCRIPT italic_η end_FLOATSUBSCRIPT = italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and can be depicted graphically as {tikzpicture} . Composing η𝜂\etaitalic_η with the rest of the learning system in  (8) gives us the closed parametric lens

{tikzpicture}
(9)

This composite is now a map in 𝐏𝐚𝐫𝐚(𝐋𝐞𝐧𝐬(𝒞))𝐏𝐚𝐫𝐚𝐋𝐞𝐧𝐬𝒞\mathbf{Para}(\mathbf{Lens}(\operatorname{\mathcal{C}}))bold_Para ( bold_Lens ( caligraphic_C ) ) from (11)FRACOP11\left({{1}\atop{1}}\right)( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ) to (11)FRACOP11\left({{1}\atop{1}}\right)( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ); all its inputs and outputs are now vertical wires, ie., parameters. Unpacking it further, this is a lens of type (A×P×BA×P×B)(11)FRACOP𝐴𝑃𝐵superscript𝐴superscript𝑃superscript𝐵FRACOP11\left({{A\times P\times B}\atop{A^{\prime}\times P^{\prime}\times B^{\prime}}}% \right)\to\left({{1}\atop{1}}\right)( FRACOP start_ARG italic_A × italic_P × italic_B end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) → ( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ) whose 𝗀𝖾𝗍𝗀𝖾𝗍\mathsf{get}{}sansserif_get map is the terminal map, and whose 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}{}sansserif_put map is of the type A×P×BA×P×B𝐴𝑃𝐵superscript𝐴superscript𝑃superscript𝐵A\times P\times B\to A^{\prime}\times P^{\prime}\times B^{\prime}italic_A × italic_P × italic_B → italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. It can be unpacked as the composite

𝗉𝗎𝗍(a,p,bt)=(a,p,bt) where bp𝗉𝗎𝗍𝑎𝑝subscript𝑏𝑡superscript𝑎superscript𝑝superscriptsubscript𝑏𝑡 where subscript𝑏𝑝\displaystyle\mathsf{put}{}(a,p,b_{t})=(a^{\prime},p^{\prime},b_{t}^{\prime})% \qquad\texttt{ where }\quad\qquad b_{p}sansserif_put ( italic_a , italic_p , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT =f(p,a)absent𝑓𝑝𝑎\displaystyle=f(p,a)= italic_f ( italic_p , italic_a )
(bt,bp)subscriptsuperscript𝑏𝑡subscriptsuperscript𝑏𝑝\displaystyle(b^{\prime}_{t},b^{\prime}_{p})( italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) =R[𝗅𝗈𝗌𝗌](bt,bp,α(𝗅𝗈𝗌𝗌(bt,bp)))absent𝑅delimited-[]𝗅𝗈𝗌𝗌subscript𝑏𝑡subscript𝑏𝑝𝛼𝗅𝗈𝗌𝗌subscript𝑏𝑡subscript𝑏𝑝\displaystyle=R[\mathsf{loss}](b_{t},b_{p},\alpha(\mathsf{loss}(b_{t},b_{p})))= italic_R [ sansserif_loss ] ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_α ( sansserif_loss ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) )
(p,a)superscript𝑝superscript𝑎\displaystyle(p^{\prime},a^{\prime})( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =R[f](p,a,bp)absent𝑅delimited-[]𝑓𝑝𝑎subscriptsuperscript𝑏𝑝\displaystyle=R[f](p,a,b^{\prime}_{p})= italic_R [ italic_f ] ( italic_p , italic_a , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT )

In the next two sections we consider further additions to the image above which correspond to different types of supervised learning.

4.1. Supervised Learning of Parameters

The most common type of learning performed on (9) is supervised learning of parameters. This is done by reparameterising (cf. Section 2.1) the image in the following manner. The parameter ports are reparameterised by one of the (possibly stateful) optimisers described in the previous section, while the backward wires Asuperscript𝐴A^{\prime}italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of inputs and Bsuperscript𝐵B^{\prime}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of outputs are discarded. This finally yields the complete picture of a system which learns the parameters in a supervised manner:

{tikzpicture}{tikzpicture}{{\begin{tikzpicture}}}

Fixing a particular optimiser (UU*):(S×PS×P)(PP):FRACOP𝑈superscript𝑈FRACOP𝑆𝑃𝑆𝑃FRACOP𝑃superscript𝑃\left({{U}\atop{U^{*}}}\right):\left({{S\times P}\atop{S\times P}}\right)\to% \left({{P}\atop{P^{\prime}}}\right)( FRACOP start_ARG italic_U end_ARG start_ARG italic_U start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ) : ( FRACOP start_ARG italic_S × italic_P end_ARG start_ARG italic_S × italic_P end_ARG ) → ( FRACOP start_ARG italic_P end_ARG start_ARG italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) we again unpack the entire construction. This is a map in 𝐏𝐚𝐫𝐚(𝐋𝐞𝐧𝐬(𝒞))𝐏𝐚𝐫𝐚𝐋𝐞𝐧𝐬𝒞\mathbf{Para}(\mathbf{Lens}(\operatorname{\mathcal{C}}))bold_Para ( bold_Lens ( caligraphic_C ) ) from (11)FRACOP11\left({{1}\atop{1}}\right)( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ) to (11)FRACOP11\left({{1}\atop{1}}\right)( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ) whose parameter space is (A×S×P×BS×P)FRACOP𝐴𝑆𝑃𝐵𝑆𝑃\left({{A\times S\times P\times B}\atop{S\times P}}\right)( FRACOP start_ARG italic_A × italic_S × italic_P × italic_B end_ARG start_ARG italic_S × italic_P end_ARG ). In other words, this is a lens of type (A×S×P×BS×P)(11)FRACOP𝐴𝑆𝑃𝐵𝑆𝑃FRACOP11\left({{A\times S\times P\times B}\atop{S\times P}}\right)\to\left({{1}\atop{1% }}\right)( FRACOP start_ARG italic_A × italic_S × italic_P × italic_B end_ARG start_ARG italic_S × italic_P end_ARG ) → ( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ) whose 𝗀𝖾𝗍𝗀𝖾𝗍\mathsf{get}{}sansserif_get component is the terminal map. Its 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}{}sansserif_put map has the type A×S×P×BS×P𝐴𝑆𝑃𝐵𝑆𝑃A\times S\times P\times B\to S\times Pitalic_A × italic_S × italic_P × italic_B → italic_S × italic_P and unpacks to 𝗉𝗎𝗍(a,s,p,bt)=U*(s,p,p)𝗉𝗎𝗍𝑎𝑠𝑝subscript𝑏𝑡superscript𝑈𝑠𝑝superscript𝑝\mathsf{put}{}(a,s,p,b_{t})=U^{*}(s,p,p^{\prime})sansserif_put ( italic_a , italic_s , italic_p , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_U start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_p , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where

𝗉𝗎𝗍(a,s,p,bt)=U*(s,p,p) where p¯𝗉𝗎𝗍𝑎𝑠𝑝subscript𝑏𝑡superscript𝑈𝑠𝑝superscript𝑝 where ¯𝑝\displaystyle\mathsf{put}{}(a,s,p,b_{t})=U^{*}(s,p,p^{\prime})\qquad\texttt{ % where }\quad\qquad\overline{p}sansserif_put ( italic_a , italic_s , italic_p , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_U start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_p , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where over¯ start_ARG italic_p end_ARG =U(s,p)absent𝑈𝑠𝑝\displaystyle=U(s,p)= italic_U ( italic_s , italic_p )
bpsubscript𝑏𝑝\displaystyle b_{p}italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT =f(p¯,a)absent𝑓¯𝑝𝑎\displaystyle=f(\overline{p},a)= italic_f ( over¯ start_ARG italic_p end_ARG , italic_a )
(bt,bp)subscriptsuperscript𝑏𝑡subscriptsuperscript𝑏𝑝\displaystyle(b^{\prime}_{t},b^{\prime}_{p})( italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) =R[𝗅𝗈𝗌𝗌](bt,bp,α(𝗅𝗈𝗌𝗌(bt,bp)))absent𝑅delimited-[]𝗅𝗈𝗌𝗌subscript𝑏𝑡subscript𝑏𝑝𝛼𝗅𝗈𝗌𝗌subscript𝑏𝑡subscript𝑏𝑝\displaystyle=R[\mathsf{loss}](b_{t},b_{p},\alpha(\mathsf{loss}(b_{t},b_{p})))= italic_R [ sansserif_loss ] ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_α ( sansserif_loss ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) )
(p,a)superscript𝑝superscript𝑎\displaystyle(p^{\prime},a^{\prime})( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =R[f](p¯,a,bp)absent𝑅delimited-[]𝑓¯𝑝𝑎subscriptsuperscript𝑏𝑝\displaystyle=R[f](\overline{p},a,b^{\prime}_{p})= italic_R [ italic_f ] ( over¯ start_ARG italic_p end_ARG , italic_a , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT )

While this formulation might seem daunting, we note that it just explicitly specifies the computation performed by a supervised learning system. The variable p¯¯𝑝\overline{p}over¯ start_ARG italic_p end_ARG represents the parameter supplied to the network by the stateful gradient update rule (in many cases this is equal to p𝑝pitalic_p); bpsubscript𝑏𝑝b_{p}italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT represents the prediction of the network (contrast this with btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT which represents the ground truth from the dataset). Variables with a tick {}^{\prime}start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT represent changes: bpsubscriptsuperscript𝑏𝑝b^{\prime}_{p}italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and btsubscriptsuperscript𝑏𝑡b^{\prime}_{t}italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are the changes on predictions and true values respectively, while psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and asuperscript𝑎a^{\prime}italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are changes on the parameters and inputs. Furthermore, this arises automatically out of the rule for lens composition (3); what we needed to specify is just the lenses themselves.

We justify and illustrate our approach on a series of case studies drawn from the literature. This presentation has the advantage of treating all these instances uniformly in terms of basic constructs, highlighting their similarities and differences. First, we fix some parametric map (p,f):𝐏𝐚𝐫𝐚(𝐒𝐦𝐨𝐨𝐭𝐡)(a,b):superscript𝑝𝑓𝐏𝐚𝐫𝐚𝐒𝐦𝐨𝐨𝐭𝐡superscript𝑎superscript𝑏(\operatorname{\mathbb{R}}^{p},f):\mathbf{Para}(\mathbf{Smooth})(\operatorname% {\mathbb{R}}^{a},\operatorname{\mathbb{R}}^{b})( blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_f ) : bold_Para ( bold_Smooth ) ( blackboard_R start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) in 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth and the constant negative learning rate α::𝛼\alpha:\operatorname{\mathbb{R}}italic_α : blackboard_R (Example  3.3). We then vary the loss function and the gradient update, seeing how the 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}{}sansserif_put map above reduces to many of the known cases in the literature.

{exa}

[Quadratic error, basic gradient descent] Fix the quadratic error (Example 3.2) as the loss map and basic gradient update (Example 4). Then the aforementioned 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}{}sansserif_put map simplifies. Since there is no state, its type reduces to A×P×BP𝐴𝑃𝐵𝑃A\times P\times B\to Pitalic_A × italic_P × italic_B → italic_P, and we have 𝗉𝗎𝗍(a,p,bt)=p+p𝗉𝗎𝗍𝑎𝑝subscript𝑏𝑡𝑝superscript𝑝\mathsf{put}{}(a,p,b_{t})=p+p^{\prime}sansserif_put ( italic_a , italic_p , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p + italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where (p,a)=R[f](p,a,α(f(p,a)bt))superscript𝑝superscript𝑎𝑅delimited-[]𝑓𝑝𝑎𝛼𝑓𝑝𝑎subscript𝑏𝑡(p^{\prime},a^{\prime})=R[f](p,a,\alpha\cdot(f(p,a)-b_{t}))( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_R [ italic_f ] ( italic_p , italic_a , italic_α ⋅ ( italic_f ( italic_p , italic_a ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ).

Note that α𝛼\alphaitalic_α here is simply a constant, and due to the linearity of the reverse derivative (Def  2.4), we can slide the α𝛼\alphaitalic_α from the costate into the basic gradient update lens. Rewriting this update, and performing this sliding we obtain a closed form update step

𝗉𝗎𝗍(a,p,bt)=p+α(R[f](p,a,f(p,a)bt);π0)𝗉𝗎𝗍𝑎𝑝subscript𝑏𝑡𝑝𝛼𝑅delimited-[]𝑓𝑝𝑎𝑓𝑝𝑎subscript𝑏𝑡subscript𝜋0\mathsf{put}{}(a,p,b_{t})=p+\alpha\cdot(R[f](p,a,f(p,a)-b_{t});\pi_{0})sansserif_put ( italic_a , italic_p , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p + italic_α ⋅ ( italic_R [ italic_f ] ( italic_p , italic_a , italic_f ( italic_p , italic_a ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ; italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )

where the negative descent component of gradient descent is here contained in the choice of the negative constant α𝛼\alphaitalic_α.

This example gives us a variety of regression algorithms solved iteratively by gradient descent: it embeds some parametric map (p,f):ab:superscript𝑝𝑓superscript𝑎superscript𝑏(\mathbb{R}^{p},f)\colon\mathbb{R}^{a}\to\mathbb{R}^{b}( blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_f ) : blackboard_R start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT into the system which performs regression on input data - where a𝑎aitalic_a denotes the input to the model and btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the ground truth. If the corresponding f𝑓fitalic_f is linear and b=1𝑏1b=1italic_b = 1, we recover simple linear regression with gradient descent. If the codomain is multi-dimensional, i.e. we are predicting multiple scalars, then we recover multivariate linear regression. Likewise, we can model a multi-layer perceptron or even more complex neural network architectures performing supervised learning of parameters simply by changing the underlying parametric map.

{exa}

[Softmax cross entropy, basic gradient descent] Fix Softmax cross entropy (Example 3.2) as the loss map and basic gradient update (Example 4). Again the 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}{}sansserif_put map simplifies. The type reduces to A×P×BP𝐴𝑃𝐵𝑃A\times P\times B\to Pitalic_A × italic_P × italic_B → italic_P and we have

𝗉𝗎𝗍(a,p,bt)=p+p𝗉𝗎𝗍𝑎𝑝subscript𝑏𝑡𝑝superscript𝑝\mathsf{put}{}(a,p,b_{t})=p+p^{\prime}sansserif_put ( italic_a , italic_p , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p + italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

where (p,a)=R[f](p¯,a,α(Softmax(f(p,a))bt))superscript𝑝superscript𝑎𝑅delimited-[]𝑓¯𝑝𝑎𝛼Softmax𝑓𝑝𝑎subscript𝑏𝑡(p^{\prime},a^{\prime})=R[f](\overline{p},a,\alpha\cdot(\mathrm{Softmax}(f(p,a% ))-b_{t}))( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_R [ italic_f ] ( over¯ start_ARG italic_p end_ARG , italic_a , italic_α ⋅ ( roman_Softmax ( italic_f ( italic_p , italic_a ) ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ). The same rewriting performed on the previous example can be done here.

This example recovers logistic regression, e.g. classification.

{exa}

[Mean squared error, Nesterov Momentum] Fix the quadratic error (Example 3.2) as the loss map and Nesterov momentum (Example 5) as the gradient update. This time the 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}{}sansserif_put map A×S×P×BS×P𝐴𝑆𝑃𝐵𝑆𝑃A\times S\times P\times B\to S\times Pitalic_A × italic_S × italic_P × italic_B → italic_S × italic_P does not have a simplified type. The implementation of 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}{}sansserif_put reduces to

𝗉𝗎𝗍(a,s,p,bt)=(s,p+s) where p¯𝗉𝗎𝗍𝑎𝑠𝑝subscript𝑏𝑡superscript𝑠𝑝superscript𝑠 where ¯𝑝\displaystyle\mathsf{put}{}(a,s,p,b_{t})=(s^{\prime},p+s^{\prime})\qquad% \texttt{ where }\quad\qquad\overline{p}sansserif_put ( italic_a , italic_s , italic_p , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p + italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where over¯ start_ARG italic_p end_ARG =p+γsabsent𝑝𝛾𝑠\displaystyle=p+\gamma s= italic_p + italic_γ italic_s
(p,a)superscript𝑝superscript𝑎\displaystyle(p^{\prime},a^{\prime})( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =R[f](p¯,a,α(f(p¯,a)bt))absent𝑅delimited-[]𝑓¯𝑝𝑎𝛼𝑓¯𝑝𝑎subscript𝑏𝑡\displaystyle=R[f](\overline{p},a,\alpha\cdot(f(\overline{p},a)-b_{t}))= italic_R [ italic_f ] ( over¯ start_ARG italic_p end_ARG , italic_a , italic_α ⋅ ( italic_f ( over¯ start_ARG italic_p end_ARG , italic_a ) - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
ssuperscript𝑠\displaystyle s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =γs+pabsent𝛾𝑠superscript𝑝\displaystyle=-\gamma s+p^{\prime}= - italic_γ italic_s + italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

This example with Nesterov momentum differs in two key points from all the other ones: i) the optimiser is stateful, and ii) its 𝗀𝖾𝗍𝗀𝖾𝗍\mathsf{get}{}sansserif_get map is not trivial. While many other optimisers are stateful, the non-triviality of the 𝗀𝖾𝗍𝗀𝖾𝗍\mathsf{get}{}sansserif_get map here showcases the importance of lenses. They allow us to make precise the notion of computing a “lookahead” value for Nesterov momentum, something that is in practice usually handled in ad-hoc ways. Here, the algebra of lens composition handles this case naturally by using the 𝗀𝖾𝗍𝗀𝖾𝗍\mathsf{get}{}sansserif_get map, a seemingly trivial, unused piece of data for previous optimisers.

Our last example, using a different base category POLY2subscriptPOLYsubscript2\mathrm{POLY}_{\mathbb{Z}_{2}}roman_POLY start_POSTSUBSCRIPT blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, shows that our framework captures learning in not just continuous, but discrete settings too. Again, we fix a parametric map (p,f):POLY2(a,b):superscript𝑝𝑓subscriptPOLYsubscript2superscript𝑎superscript𝑏(\operatorname{\mathbb{Z}}^{p},f):\mathrm{POLY}_{\mathbb{Z}_{2}}(\operatorname% {\mathbb{Z}}^{a},\operatorname{\mathbb{Z}}^{b})( blackboard_Z start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_f ) : roman_POLY start_POSTSUBSCRIPT blackboard_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_Z start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , blackboard_Z start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) but this time we fix the identity learning rate (Example 3.3), instead of a constant one.

{exa}

[Basic learning in Boolean circuits] Fix XOR as the loss map (Example 3.2) and the basic gradient update (Example 4). The put map again simplifies. The type reduces to A×P×BP𝐴𝑃𝐵𝑃A\times P\times B\to Pitalic_A × italic_P × italic_B → italic_P and the implementation to 𝗉𝗎𝗍(a,p,bt)=p+p𝗉𝗎𝗍𝑎𝑝subscript𝑏𝑡𝑝superscript𝑝\mathsf{put}{}(a,p,b_{t})=p+p^{\prime}sansserif_put ( italic_a , italic_p , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p + italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT where (p,a)=R[f](p,a,f(p,a)+bt)superscript𝑝superscript𝑎𝑅delimited-[]𝑓𝑝𝑎𝑓𝑝𝑎subscript𝑏𝑡(p^{\prime},a^{\prime})=R[f](p,a,f(p,a)+b_{t})( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_R [ italic_f ] ( italic_p , italic_a , italic_f ( italic_p , italic_a ) + italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

A sketch of learning iteration. Having described a number of examples in supervised learning, we outline how to model learning iteration in our framework. Recall the aforementioned 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}sansserif_put map whose type is A×P×BP𝐴𝑃𝐵𝑃A\times P\times B\to Pitalic_A × italic_P × italic_B → italic_P (for simplicity here modelled without state S𝑆Sitalic_S). This map takes an input-output pair (a0,b0)subscript𝑎0subscript𝑏0(a_{0},b_{0})( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), the current parameter pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and produces an updated parameter pi+1subscript𝑝𝑖1p_{i+1}italic_p start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT. At the next time step, it takes a potentially different input-output pair (a1,b1)subscript𝑎1subscript𝑏1(a_{1},b_{1})( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), the updated parameter pi+1subscript𝑝𝑖1p_{i+1}italic_p start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT and produces pi+2subscript𝑝𝑖2p_{i+2}italic_p start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT. This process is then repeated. We can model this iteration as a composition of the 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}{}sansserif_put map with itself, as a composite (A×𝗉𝗎𝗍×B);𝗉𝗎𝗍𝐴𝗉𝗎𝗍𝐵𝗉𝗎𝗍(A\times\mathsf{put}{}\times B);\mathsf{put}{}( italic_A × sansserif_put × italic_B ) ; sansserif_put whose type is A×A×P×B×BP𝐴𝐴𝑃𝐵𝐵𝑃A\times A\times P\times B\times B\to Pitalic_A × italic_A × italic_P × italic_B × italic_B → italic_P. This map takes two input-output pairs A×B𝐴𝐵A\times Bitalic_A × italic_B, a parameter and produces a new parameter by processing these datapoints in sequence. One can see how this process can be iterated any number of times, and even represented as a string diagram.

But we note that with a slight reformulation of the 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}sansserif_put map, it is possible to obtain a conceptually much simpler definition. The key insight lies in seeing that the map 𝗉𝗎𝗍:A×P×BP:𝗉𝗎𝗍𝐴𝑃𝐵𝑃\mathsf{put}{}:A\times P\times B\to Psansserif_put : italic_A × italic_P × italic_B → italic_P is essentially an endo-map PP𝑃𝑃P\to Pitalic_P → italic_P with some extra inputs A×B𝐴𝐵A\times Bitalic_A × italic_B; it’s a parametric map!

In other words, we can recast the 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}{}sansserif_put map as a parametric map (A×B,𝗉𝗎𝗍):𝐏𝐚𝐫𝐚(𝒞)(P,P):𝐴𝐵𝗉𝗎𝗍𝐏𝐚𝐫𝐚𝒞𝑃𝑃(A\times B,\mathsf{put}{}):\mathbf{Para}(\operatorname{\mathcal{C}})(P,P)( italic_A × italic_B , sansserif_put ) : bold_Para ( caligraphic_C ) ( italic_P , italic_P ). Being an endo-map, it can be composed with itself. The resulting composite is an endo-map taking two “parameters”: input-output pair at the time step 00 and time step 1111. This process can then be repeated, with 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para composition automatically taking care of the algebra of iteration.

{tikzpicture}

This reformulation captures the essence of parameter iteration: one can think of it as a trajectory pi,pi+1,pi+2,subscript𝑝𝑖subscript𝑝𝑖1subscript𝑝𝑖2p_{i},p_{i+1},p_{i+2},...italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT , … through the parameter space; but it is a trajectory parameterised by the dataset. With different datasets the algorithm will take a different path through this space and learn different things.

4.2. Unsupervised Learning

Many kinds of systems that are traditionally considered unsupervised can be recast to their supervised form. One example is a Generative Adversarial Network ([30, 3]). This is a neural network architecture that lies in the centre of the intersection of deep learning and game theory. It is a system of two neural networks trained with “competing” optimisers. One neural network is called the generator whose optimiser is, as usual, tasked with moving in the direction of the negative gradient of the loss. However, the other network — called the discriminator — has an optimiser which is tasked with moving in the positive, i.e. ascending direction of the gradient of the total loss — maximising the loss. The actual networks are wired in such a way (Fig. 6) where the discriminator effectively serves as a loss function to the generator, i.e. being the generator’s only source of information on how to update. Dually, taking the vantage point of the discriminator, the generator serves as an ever changing source of training data.

{defi}

[GAN] Fix three objects Z,X𝑍𝑋Z,Xitalic_Z , italic_X and L𝐿Litalic_L in 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C (respectively called “the latent space”, “the data space” and “the payoff space”). Then given two parametric morphisms

(P,g):𝐏𝐚𝐫𝐚(𝒞)(Z,X)and(Q,d):𝐏𝐚𝐫𝐚(𝒞)(X,L):𝑃𝑔𝐏𝐚𝐫𝐚𝒞𝑍𝑋and𝑄𝑑:𝐏𝐚𝐫𝐚𝒞𝑋𝐿(P,g):\mathbf{Para}(\operatorname{\mathcal{C}})(Z,X)\quad\text{and}\quad(Q,d):% \mathbf{Para}(\operatorname{\mathcal{C}})(X,L)( italic_P , italic_g ) : bold_Para ( caligraphic_C ) ( italic_Z , italic_X ) and ( italic_Q , italic_d ) : bold_Para ( caligraphic_C ) ( italic_X , italic_L )

a generative adversarial network is a morphism (P×Q,𝖦𝖠𝖭g,d):𝐏𝐚𝐫𝐚(𝒞)(Z×X,L×L):𝑃𝑄subscript𝖦𝖠𝖭𝑔𝑑𝐏𝐚𝐫𝐚𝒞𝑍𝑋𝐿𝐿(P\times Q,\mathsf{GAN}_{g,d}):\mathbf{Para}(\operatorname{\mathcal{C}})(Z% \times X,L\times L)( italic_P × italic_Q , sansserif_GAN start_POSTSUBSCRIPT italic_g , italic_d end_POSTSUBSCRIPT ) : bold_Para ( caligraphic_C ) ( italic_Z × italic_X , italic_L × italic_L ) where 𝖦𝖠𝖭g,dsubscript𝖦𝖠𝖭𝑔𝑑\mathsf{GAN}_{g,d}sansserif_GAN start_POSTSUBSCRIPT italic_g , italic_d end_POSTSUBSCRIPT is defined as the composite

Z×X×P×QZ×P×X×Qg×X×ΔQX×X×Q×QX×Q×X×Qd×dL×L𝑍𝑋𝑃𝑄𝑍𝑃𝑋𝑄𝑔𝑋subscriptΔ𝑄𝑋𝑋𝑄𝑄𝑋𝑄𝑋𝑄𝑑𝑑𝐿𝐿Z\times X\times P\times Q\cong Z\times P\times X\times Q\xrightarrow{g\times X% \times\Delta_{Q}}X\times X\times Q\times Q\cong X\times Q\times X\times Q% \xrightarrow{d\times d}L\times Litalic_Z × italic_X × italic_P × italic_Q ≅ italic_Z × italic_P × italic_X × italic_Q start_ARROW start_OVERACCENT italic_g × italic_X × roman_Δ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_X × italic_X × italic_Q × italic_Q ≅ italic_X × italic_Q × italic_X × italic_Q start_ARROW start_OVERACCENT italic_d × italic_d end_OVERACCENT → end_ARROW italic_L × italic_L
{tikzpicture}
Figure 6. A generative adversarial network as a parametric morphism.

Its string diagram representation is shown in Fig. 6 where we see that a GAN consists of two parallel tracks. We will see in Ex. 7 how the first one will be used to process latent vectors, and the second one to process samples from a chosen dataset. Despite the fact that there are two boxes labeled d𝑑ditalic_d, they are weight tied, making them behave like a singular unit.

We can easily state what the reverse derivative of 𝖦𝖠𝖭g,dsubscript𝖦𝖠𝖭𝑔𝑑\mathsf{GAN}_{g,d}sansserif_GAN start_POSTSUBSCRIPT italic_g , italic_d end_POSTSUBSCRIPT is in terms of its components:

R[𝖦𝖠𝖭g,d](z,xr,p,q,αg,αr)=(z,xr,p,qg+qr)where(xg,qg)=R[d](g(z,p),q,αg)(xr,qr)=R[d](xr,q,αr)(z,p)=R[g](z,p,xg)formulae-sequence𝑅delimited-[]subscript𝖦𝖠𝖭𝑔𝑑𝑧subscript𝑥𝑟𝑝𝑞subscript𝛼𝑔subscript𝛼𝑟superscript𝑧superscriptsubscript𝑥𝑟superscript𝑝superscriptsubscript𝑞𝑔superscriptsubscript𝑞𝑟wheresuperscriptsubscript𝑥𝑔superscriptsubscript𝑞𝑔𝑅delimited-[]𝑑𝑔𝑧𝑝𝑞subscript𝛼𝑔superscriptsubscript𝑥𝑟superscriptsubscript𝑞𝑟𝑅delimited-[]𝑑subscript𝑥𝑟𝑞subscript𝛼𝑟superscript𝑧superscript𝑝𝑅delimited-[]𝑔𝑧𝑝superscriptsubscript𝑥𝑔\displaystyle\begin{split}R[\mathsf{GAN}_{g,d}](z,x_{r},p,q,\alpha_{g},\alpha_% {r})=(z^{\prime},x_{r}^{\prime},p^{\prime},q_{g}^{\prime}+q_{r}^{\prime})\quad% \text{where}\quad(x_{g}^{\prime},q_{g}^{\prime})&=R[d](g(z,p),q,\alpha_{g})\\ (x_{r}^{\prime},q_{r}^{\prime})&=R[d](x_{r},q,\alpha_{r})\\ (z^{\prime},p^{\prime})&=R[g](z,p,x_{g}^{\prime})\end{split}start_ROW start_CELL italic_R [ sansserif_GAN start_POSTSUBSCRIPT italic_g , italic_d end_POSTSUBSCRIPT ] ( italic_z , italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_p , italic_q , italic_α start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where ( italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL = italic_R [ italic_d ] ( italic_g ( italic_z , italic_p ) , italic_q , italic_α start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL = italic_R [ italic_d ] ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_q , italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL = italic_R [ italic_g ] ( italic_z , italic_p , italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW (10)

The pair (𝖦𝖠𝖭g,d,R[𝖦𝖠𝖭g,d])subscript𝖦𝖠𝖭𝑔𝑑𝑅delimited-[]subscript𝖦𝖠𝖭𝑔𝑑(\mathsf{GAN}_{g,d},R[\mathsf{GAN}_{g,d}])( sansserif_GAN start_POSTSUBSCRIPT italic_g , italic_d end_POSTSUBSCRIPT , italic_R [ sansserif_GAN start_POSTSUBSCRIPT italic_g , italic_d end_POSTSUBSCRIPT ] ) yields a parametric lens of type (Z×X)(Z×X)(L×L)(L×L)𝑍𝑋superscript𝑍superscript𝑋𝐿𝐿superscript𝐿superscript𝐿(Z\times X)(Z^{\prime}\times X^{\prime})\to(L\times L)(L^{\prime}\times L^{% \prime})( italic_Z × italic_X ) ( italic_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) → ( italic_L × italic_L ) ( italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (Fig. 7), which we interpret as follows.

It consumes two pieces of data, “a latent vector” z:Z:𝑧𝑍z:Zitalic_z : italic_Z, a “real” sample from the dataset xr:X:subscript𝑥𝑟𝑋x_{r}:Xitalic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT : italic_X, in addition to the parameter p:P:𝑝𝑃p:Pitalic_p : italic_P for the generator and a parameter q:Q:𝑞𝑄q:Qitalic_q : italic_Q for the discriminator. What happens then are two independent evaluations done by the discriminator. The first one uses the generator’s attempt of producing a sample from the dataset (the latent vector which was fed into it, producing g(z,p):X:𝑔𝑧𝑝𝑋g(z,p):Xitalic_g ( italic_z , italic_p ) : italic_X) as input to the discriminator, producing a payoff d((g,z,p),q):L:𝑑𝑔𝑧𝑝𝑞𝐿d((g,z,p),q):Litalic_d ( ( italic_g , italic_z , italic_p ) , italic_q ) : italic_L for this particular sample. The second one uses the actual sample from the dataset xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, producing the payoff d(xr,q):L:𝑑subscript𝑥𝑟𝑞𝐿d(x_{r},q):Litalic_d ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_q ) : italic_L.

By choosing 𝖦𝖠𝖭g,dsubscript𝖦𝖠𝖭𝑔𝑑\mathsf{GAN}_{g,d}sansserif_GAN start_POSTSUBSCRIPT italic_g , italic_d end_POSTSUBSCRIPT as the parametric map representing our supervised learning model, we can differentiate it as in (Fig. 7), and, with the appropriate choice of a loss function, produce the learning system in the literature called Wasserstein GAN ([3]).

{tikzpicture}
Figure 7. A generative adversarial network under the image of 𝐏𝐚𝐫𝐚(𝐑𝒞)𝐏𝐚𝐫𝐚subscript𝐑𝒞\mathbf{Para}(\mathbf{R}_{\operatorname{\mathcal{C}}})bold_Para ( bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ).
{exa}

[GANs, Dot product, GDA] Fix 𝖦𝖠𝖭g,dsubscript𝖦𝖠𝖭𝑔𝑑\mathsf{GAN}_{g,d}sansserif_GAN start_POSTSUBSCRIPT italic_g , italic_d end_POSTSUBSCRIPT as the parametric map (Def. 4.2), gradient descent-ascent (Ex. 3.4.1) as the optimiser and dot product (Ex. 3.2) as the loss function. Then 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}{}sansserif_put becomes a map of type Z×X×P×Q×L×LP×Q𝑍𝑋𝑃𝑄𝐿𝐿𝑃𝑄Z\times X\times P\times Q\times L\times L\to P\times Qitalic_Z × italic_X × italic_P × italic_Q × italic_L × italic_L → italic_P × italic_Q and its implementation reduces to

𝗉𝗎𝗍(z,x,p,q,bt)=(pp,q+q)where(z,x,p,q)𝗉𝗎𝗍𝑧𝑥𝑝𝑞subscript𝑏𝑡𝑝superscript𝑝𝑞superscript𝑞wheresuperscript𝑧superscript𝑥superscript𝑝superscript𝑞\displaystyle\mathsf{put}{}(z,x,p,q,b_{t})=(p-p^{\prime},q+q^{\prime})\qquad% \text{where}\quad\qquad(z^{\prime},x^{\prime},p^{\prime},q^{\prime})sansserif_put ( italic_z , italic_x , italic_p , italic_q , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_p - italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q + italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =R[𝖦𝖠𝖭g,d](z,x,p,q,αbt)absent𝑅delimited-[]subscript𝖦𝖠𝖭𝑔𝑑𝑧𝑥𝑝𝑞𝛼subscript𝑏𝑡\displaystyle=R[\mathsf{GAN}_{g,d}](z,x,p,q,\alpha\cdot b_{t})= italic_R [ sansserif_GAN start_POSTSUBSCRIPT italic_g , italic_d end_POSTSUBSCRIPT ] ( italic_z , italic_x , italic_p , italic_q , italic_α ⋅ italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

We can further unpack the label

𝗉𝗎𝗍(z,xr,p,q,btg,btr)=(pp,q+qg+qr)where(xg,qg)𝗉𝗎𝗍𝑧subscript𝑥𝑟𝑝𝑞subscriptsubscript𝑏𝑡𝑔subscriptsubscript𝑏𝑡𝑟𝑝superscript𝑝𝑞superscriptsubscript𝑞𝑔superscriptsubscript𝑞𝑟wheresuperscriptsubscript𝑥𝑔superscriptsubscript𝑞𝑔\displaystyle\mathsf{put}{}(z,x_{r},p,q,{b_{t}}_{g},{b_{t}}_{r})=(p-p^{\prime}% ,q+q_{g}^{\prime}+q_{r}^{\prime})\qquad\text{where}\quad\qquad(x_{g}^{\prime},% q_{g}^{\prime})sansserif_put ( italic_z , italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_p , italic_q , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = ( italic_p - italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q + italic_q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where ( italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =R[d](g(z,p),q,αbtg)absent𝑅delimited-[]𝑑𝑔𝑧𝑝𝑞𝛼subscriptsubscript𝑏𝑡𝑔\displaystyle=R[d](g(z,p),q,\alpha\cdot{b_{t}}_{g})= italic_R [ italic_d ] ( italic_g ( italic_z , italic_p ) , italic_q , italic_α ⋅ italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT )
(z,p)superscript𝑧superscript𝑝\displaystyle(z^{\prime},p^{\prime})( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =R[g](z,p,xg)absent𝑅delimited-[]𝑔𝑧𝑝superscriptsubscript𝑥𝑔\displaystyle=R[g](z,p,x_{g}^{\prime})= italic_R [ italic_g ] ( italic_z , italic_p , italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
(xr,qr)superscriptsubscript𝑥𝑟superscriptsubscript𝑞𝑟\displaystyle(x_{r}^{\prime},q_{r}^{\prime})( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =R[d](xr,q,αbtr)absent𝑅delimited-[]𝑑subscript𝑥𝑟𝑞𝛼subscriptsubscript𝑏𝑡𝑟\displaystyle=R[d](x_{r},q,\alpha\cdot{b_{t}}_{r})= italic_R [ italic_d ] ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_q , italic_α ⋅ italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT )

This brings us to the last step, where by linearity of the backward pass we can extract α𝛼\alphaitalic_α and components of btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT out:

𝗉𝗎𝗍(z,xr,p,q,btg,btr)=(pαbtgp,q+α(btgqg+btrqr))where(xg,qg)𝗉𝗎𝗍𝑧subscript𝑥𝑟𝑝𝑞subscriptsubscript𝑏𝑡𝑔subscriptsubscript𝑏𝑡𝑟𝑝𝛼subscriptsubscript𝑏𝑡𝑔superscript𝑝𝑞𝛼subscriptsubscript𝑏𝑡𝑔superscriptsubscript𝑞𝑔subscriptsubscript𝑏𝑡𝑟superscriptsubscript𝑞𝑟wheresuperscriptsubscript𝑥𝑔superscriptsubscript𝑞𝑔\displaystyle\mathsf{put}{}(z,x_{r},p,q,{b_{t}}_{g},{b_{t}}_{r})=(p-\alpha{b_{% t}}_{g}p^{\prime},q+\alpha({b_{t}}_{g}q_{g}^{\prime}+{b_{t}}_{r}q_{r}^{\prime}% ))\qquad\text{where}\quad(x_{g}^{\prime},q_{g}^{\prime})sansserif_put ( italic_z , italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_p , italic_q , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = ( italic_p - italic_α italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q + italic_α ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) where ( italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =R[d](g(z,p),q,1)absent𝑅delimited-[]𝑑𝑔𝑧𝑝𝑞1\displaystyle=R[d](g(z,p),q,1)= italic_R [ italic_d ] ( italic_g ( italic_z , italic_p ) , italic_q , 1 )
(z,p)superscript𝑧superscript𝑝\displaystyle(z^{\prime},p^{\prime})( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =R[g](z,p,xg)absent𝑅delimited-[]𝑔𝑧𝑝superscriptsubscript𝑥𝑔\displaystyle=R[g](z,p,x_{g}^{\prime})= italic_R [ italic_g ] ( italic_z , italic_p , italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
(xr,qr)superscriptsubscript𝑥𝑟superscriptsubscript𝑞𝑟\displaystyle(x_{r}^{\prime},q_{r}^{\prime})( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_q start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =R[d](xr,q,1)absent𝑅delimited-[]𝑑subscript𝑥𝑟𝑞1\displaystyle=R[d](x_{r},q,1)= italic_R [ italic_d ] ( italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_q , 1 )

The ultimate representation is a form in which it makes it possible to see how the update recovers that of Wasserstein GANs. The last missing piece is to note that the supervision labels ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT here are effectively “masks”. Just like in standard supervised learning an input-output pair (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) consisted of an input value and a corresponding label which guided the direction in which the output f(xi,p)𝑓subscript𝑥𝑖𝑝f(x_{i},p)italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p ) should’ve been improved, here the situation is the same. Given any latent vector z𝑧zitalic_z its corresponding “label” is the learning signal btg=1subscriptsubscript𝑏𝑡𝑔1{b_{t}}_{g}=1italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = 1 which does not change anything in the update, effectively signaling to the generator’s and discriminator’s optimisers that they should descend (minimizing the assigned loss, making the image more realistic next time), and respectively ascend (maximizing the loss, becoming better at detecting when its input is a sample generated by the generator). On the other hand, given any real sample xrsubscript𝑥𝑟x_{r}italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT its corresponding “label” is the learning signal btr=1subscriptsubscript𝑏𝑡𝑟1{b_{t}}_{r}=-1italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = - 1 which signals to the discriminator’s optimiser that it should do the opposite of what it usually does; it should descend, causing it to assign a lower loss value actual samples from the dataset. In other words, the input-output pairs here are always of the form ((z,x)i,(1,1)i)subscript𝑧𝑥𝑖subscript11𝑖((z,x)_{i},(1,-1)_{i})( ( italic_z , italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ( 1 , - 1 ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), making this GAN in many ways a constantly supervised model. Nonetheless, these different “forces” that pull the discriminator in different directions depending on the source of the input, coupled with the ever-changing generated inputs make GANs have intrinsically complex dynamics that are still being studied.

The fact that we were able to encode Wasserstein GAN in this form in our framework is a consequence of its simple formulation of its loss function, which is effectively given by subtraction [3, Theorem 3].

4.3. Deep Dreaming: Supervised Learning of Inputs

We have seen that reparameterising the parameter port with gradient descent allows us to capture supervised parameter learning. In this section we describe how reparameterising the input port provides us with a way to enhance an input image to elicit a particular interpretation. This is the idea behind the technique called Deep Dreaming, appearing in the literature in many forms [40, 38, 22, 51].

{tikzpicture}{tikzpicture}{{\begin{tikzpicture}}} (11)

Deep dreaming is a technique which uses the parameters p𝑝pitalic_p of some trained classifier network to iteratively dream up, or amplify some features of a class b𝑏bitalic_b on a chosen input a𝑎aitalic_a. For example, if we start with an image of a landscape a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, a label b𝑏bitalic_b of a “cat” and a parameter p𝑝pitalic_p of a sufficiently well-trained classifier, we can start performing “learning” as usual: computing the predicted class for the landscape a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for the network with parameters p𝑝pitalic_p, and then computing the distance between the prediction and our label of a cat b𝑏bitalic_b. When performing backpropagation, the respective changes computed for each layer tell us how the activations of that layer should have been changed to be more “cat” like. This includes the first (input) layer of the landscape a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Usually, we discard this changes and apply gradient update to the parameters. In deep dreaming we discard the parameters and apply gradient update to the input (see (11)). Gradient update here takes these changes and computes a new image a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT which is the same image of the landscape, but changed slightly so to look more like whatever the network thinks a cat looks like. This is the essence of deep dreaming, where iteration of this process allows networks to dream up features and shapes on a particular chosen image [39].

Just like in the previous subsection, we can write this deep dreaming system as a map in 𝐏𝐚𝐫𝐚(𝐋𝐞𝐧𝐬(𝒞))𝐏𝐚𝐫𝐚𝐋𝐞𝐧𝐬𝒞\mathbf{Para}(\mathbf{Lens}(\operatorname{\mathcal{C}}))bold_Para ( bold_Lens ( caligraphic_C ) ) from (11)FRACOP11\left({{1}\atop{1}}\right)( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ) to (11)FRACOP11\left({{1}\atop{1}}\right)( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ) whose parameter space is (S×A×P×BS×A)FRACOP𝑆𝐴𝑃𝐵𝑆𝐴\left({{S\times A\times P\times B}\atop{S\times A}}\right)( FRACOP start_ARG italic_S × italic_A × italic_P × italic_B end_ARG start_ARG italic_S × italic_A end_ARG ). In other words, this is a lens of type (S×A×P×BS×A)(11)FRACOP𝑆𝐴𝑃𝐵𝑆𝐴FRACOP11\left({{S\times A\times P\times B}\atop{S\times A}}\right)\to\left({{1}\atop{1% }}\right)( FRACOP start_ARG italic_S × italic_A × italic_P × italic_B end_ARG start_ARG italic_S × italic_A end_ARG ) → ( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ) whose 𝗀𝖾𝗍𝗀𝖾𝗍\mathsf{get}{}sansserif_get map is trivial. Its 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}{}sansserif_put map has the type S×A×P×BS×A𝑆𝐴𝑃𝐵𝑆𝐴S\times A\times P\times B\to S\times Aitalic_S × italic_A × italic_P × italic_B → italic_S × italic_A and unpacks to

𝗉𝗎𝗍(s,a,p,bt)=U*(s,a,a) wherea¯𝗉𝗎𝗍𝑠𝑎𝑝subscript𝑏𝑡superscript𝑈𝑠𝑎superscript𝑎 where¯𝑎\displaystyle\mathsf{put}{}(s,a,p,b_{t})=U^{*}(s,a,a^{\prime})\qquad\texttt{ % where}\quad\qquad\overline{a}sansserif_put ( italic_s , italic_a , italic_p , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_U start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) where over¯ start_ARG italic_a end_ARG =U(s,a)absent𝑈𝑠𝑎\displaystyle=U(s,a)= italic_U ( italic_s , italic_a )
bpsubscript𝑏𝑝\displaystyle b_{p}italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT =f(p,a¯)absent𝑓𝑝¯𝑎\displaystyle=f(p,\overline{a})= italic_f ( italic_p , over¯ start_ARG italic_a end_ARG )
(bt,bp)subscriptsuperscript𝑏𝑡subscriptsuperscript𝑏𝑝\displaystyle(b^{\prime}_{t},b^{\prime}_{p})( italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) =R[𝗅𝗈𝗌𝗌](bt,bp,α(𝗅𝗈𝗌𝗌(bt,bp)))absent𝑅delimited-[]𝗅𝗈𝗌𝗌subscript𝑏𝑡subscript𝑏𝑝𝛼𝗅𝗈𝗌𝗌subscript𝑏𝑡subscript𝑏𝑝\displaystyle=R[\mathsf{loss}](b_{t},b_{p},\alpha(\mathsf{loss}(b_{t},b_{p})))= italic_R [ sansserif_loss ] ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_α ( sansserif_loss ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ) )
(p,a)superscript𝑝superscript𝑎\displaystyle(p^{\prime},a^{\prime})( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =R[f](p,a¯,bp)absent𝑅delimited-[]𝑓𝑝¯𝑎subscriptsuperscript𝑏𝑝\displaystyle=R[f](p,\overline{a},b^{\prime}_{p})= italic_R [ italic_f ] ( italic_p , over¯ start_ARG italic_a end_ARG , italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT )

We note that deep dreaming is usually presented without any loss function as a maximisation of a particular activation in the last layer of the network output [51, Section 2.]. This maximisation is done with gradient ascent, as opposed to gradient descent. However, this is just a special case of our framework where the loss function is the dot product (Example 3.2). The choice of the particular activation is encoded as a one-hot vector, and the loss function in that case essentially masks the network output, leaving active only the particular chosen activation. The final component is the gradient ascent: this is simply recovered by choosing a positive, instead of a negative learning rate [51]. We explicitly unpack this in the following example.

{exa}

[Deep dreaming, dot product loss, basic gradient update] Fix 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth as base category, a parametric map (p,f):𝐏𝐚𝐫𝐚(𝐒𝐦𝐨𝐨𝐭𝐡)(a,b):superscript𝑝𝑓𝐏𝐚𝐫𝐚𝐒𝐦𝐨𝐨𝐭𝐡superscript𝑎superscript𝑏(\operatorname{\mathbb{R}}^{p},f):\mathbf{Para}(\mathbf{Smooth})(\operatorname% {\mathbb{R}}^{a},\operatorname{\mathbb{R}}^{b})( blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_f ) : bold_Para ( bold_Smooth ) ( blackboard_R start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT , blackboard_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ), the dot product loss (Example 3.2), basic gradient update (Example 4), and a positive learning rate α::𝛼\alpha:\operatorname{\mathbb{R}}italic_α : blackboard_R. Then the above put map simplifies. Since there is no state, its type reduces to A×P×BA𝐴𝑃𝐵𝐴A\times P\times B\to Aitalic_A × italic_P × italic_B → italic_A and its implementation to

𝗉𝗎𝗍(a,p,bt)=a+a𝗉𝗎𝗍𝑎𝑝subscript𝑏𝑡𝑎superscript𝑎\displaystyle\mathsf{put}{}(a,p,b_{t})=a+a^{\prime}sansserif_put ( italic_a , italic_p , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_a + italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT  where (p,a)=R[f](p,a,αbt). where superscript𝑝superscript𝑎𝑅delimited-[]𝑓𝑝𝑎𝛼subscript𝑏𝑡\displaystyle\qquad\qquad\qquad\qquad\texttt{ where }(p^{\prime},a^{\prime})=R% [f](p,a,\alpha\cdot b_{t}).where ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_R [ italic_f ] ( italic_p , italic_a , italic_α ⋅ italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Like in Example 4.1, this update can be rewritten as

𝗉𝗎𝗍(a,p,bt)=a+α(R[f](p,a,bt);π1)𝗉𝗎𝗍𝑎𝑝subscript𝑏𝑡𝑎𝛼𝑅delimited-[]𝑓𝑝𝑎subscript𝑏𝑡subscript𝜋1\mathsf{put}{}(a,p,b_{t})=a+\alpha\cdot(R[f](p,a,b_{t});\pi_{1})sansserif_put ( italic_a , italic_p , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_a + italic_α ⋅ ( italic_R [ italic_f ] ( italic_p , italic_a , italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ; italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )

making a few things apparent. This update does not depend on the prediction f(p,a)𝑓𝑝𝑎f(p,a)italic_f ( italic_p , italic_a ): no matter what the network has predicted, the goal is always to maximise particular activations. Which activations? The ones chosen by btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. When btsubscript𝑏𝑡b_{t}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a one-hot vector, this picks out the activation of just one class to maximise, which is often done in practice.

While we present only the most basic image, there is plenty of room left for exploration. The work of [51, Section 2.] adds an extra regularisation term to the image. In general, the neural network f𝑓fitalic_f is sometimes changed to copy a number of internal activations which are then exposed on the output layer. Maximising all these activations often produces more visually appealing results. In the literature we did not find an example which uses the Softmax-cross entropy (Example 3.2) as a loss function in deep dreaming, which seems like the more natural choice in this setting. Furthermore, while deep dreaming commonly uses basic gradient descent, there is nothing preventing the use of any of the optimiser lenses discussed in the previous section, or even doing deep dreaming in the context of Boolean circuits. Lastly, learning iteration which was described in at the end of previous subsection can be modelled here in an analogous way.

5. Implementation

We provide a proof-of-concept implementation as a Python library — full usage examples, source code, and experiments can be found at [19]. We demonstrate the correctness of our library empirically using a number of experiments implemented both in our library and in Keras [12], a popular framework for deep learning. For example, one experiment is a model for the MNIST image classification problem [37]: we implement the same model in both frameworks and achieve comparable accuracy. Note that despite similarities between the user interfaces of our library and of Keras, a model in our framework is constructed as a composition of parametric lenses. This is fundamentally different to the approach taken by Keras and other existing libraries, and highlights how our proposed algebraic structures naturally guide programming practice

In summary, our implementation demonstrates the advantages of our approach. Firstly, computing the gradients of the network is greatly simplified through the use of lens composition. Secondly, model architectures can be expressed in a principled, mathematical language; as morphisms of a monoidal category. Finally, the modularity of our approach makes it easy to see how various aspects of training can be modified: for example, one can define a new optimization algorithm simply by defining an appropriate lens. We now give a brief sketch of our implementation.

5.1. Constructing a Model with Lens and Para

We model a lens (ff*)FRACOP𝑓superscript𝑓\left({{f}\atop{f^{*}}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ) in our library with the Lens class, which consists of a pair of maps fwd and rev corresponding to f𝑓fitalic_f and f*superscript𝑓f^{*}italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, respectively. For example, we write the identity lens (1Aπ2)FRACOPsubscript1𝐴subscript𝜋2\left({{1_{A}}\atop{\pi_{2}}}\right)( FRACOP start_ARG 1 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) as follows:

identity = Lens(lambda x: x, lambda x_dy: x_dy[1])

The composition (in diagrammatic order) of Lens values f and g is written f >> g, and monoidal composition as f @ g. Similarly, the type of 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para maps is modeled by the Para class, with composition and monoidal product written the same way. Our library provides several primitive Lens and Para values.

Let us now see how to construct a single layer neural network from the composition of such primitives. Diagrammatically, we will construct a model consisting of a single dense layer, as in Example 3.1.1 and below.

{tikzpicture}

Recall that the parameters of linear are the coefficients of a b×a𝑏𝑎b\times aitalic_b × italic_a matrix, and the underlying lens has as its forward map the function (M,x)Mx𝑀𝑥𝑀𝑥(M,x)\to M\cdot x( italic_M , italic_x ) → italic_M ⋅ italic_x, where M𝑀Mitalic_M is the b×a𝑏𝑎b\times aitalic_b × italic_a matrix whose coefficients are the b×asuperscript𝑏𝑎\operatorname{\mathbb{R}}^{b\times a}blackboard_R start_POSTSUPERSCRIPT italic_b × italic_a end_POSTSUPERSCRIPT parameters, and xa𝑥superscript𝑎x\in\operatorname{\mathbb{R}}^{a}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT is the input vector. The bias map is even simpler: the forward map of the underlying lens is simply pointwise addition of inputs and parameters: (b,x)b+x𝑏𝑥𝑏𝑥(b,x)\to b+x( italic_b , italic_x ) → italic_b + italic_x. Finally, the activation map simply applies a nonlinear function (e.g., sigmoid) to the input, and thus has the trivial (unit) parameter space. The representation of this composition in code is straightforward: we can simply compose the three primitive Para maps as in (6):

def dense(a, b, activation):
  return linear(a, b) >> bias(b) >> activation

Note that by constructing model architectures in this way, the computation of reverse derivatives is greatly simplified: we obtain the reverse derivative ‘for free’ as the 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}sansserif_put map of the model. Furthermore, adding new primitives is also simplified: the user need simply provide a function and its reverse derivative in the form of a Para map. Finally, notice also that our approach is truly compositional: we can define a hidden layer neural network with n𝑛nitalic_n hidden units simply by composing two dense layers, as follows:

dense(a, n, activation) >> dense(n, b, activation)

5.2. Learning

Now that we have constructed a model, we also need to use it to learn from data. Concretely, we will construct a full parametric lens as in Figure  2 then extract its 𝗉𝗎𝗍𝗉𝗎𝗍\mathsf{put}sansserif_put map to iterate over the dataset.

By way of example, let us see how to construct the following parametric lens, representing basic gradient descent over a single layer neural network with a fixed learning rate:

{tikzpicture}
(12)

This morphism is constructed essentially as below, where apply_update(α𝛼\alphaitalic_α, f𝑓fitalic_f) represents the ‘vertical stacking’ of α𝛼\alphaitalic_α atop f𝑓fitalic_f:

apply_update(basic_update, dense) >> loss >> learning_rate(ϵitalic-ϵ\epsilonitalic_ϵ)

Now, given the parametric lens of (12), one can construct a morphism 𝗌𝗍𝖾𝗉:B×P×AP:𝗌𝗍𝖾𝗉𝐵𝑃𝐴𝑃\textsf{step}:B\times P\times A\to Pstep : italic_B × italic_P × italic_A → italic_P which is simply the put map of the lens. Training the model then consists of iterating the step function over dataset examples (x,y)A×B𝑥𝑦𝐴𝐵(x,y)\in A\times B( italic_x , italic_y ) ∈ italic_A × italic_B to optimise some initial choice of parameters θ0Psubscript𝜃0𝑃\theta_{0}\in Pitalic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_P, by letting θi+1=𝗌𝗍𝖾𝗉(yi,θi,xi)subscript𝜃𝑖1𝗌𝗍𝖾𝗉subscript𝑦𝑖subscript𝜃𝑖subscript𝑥𝑖\theta_{i+1}=\mathsf{step}(y_{i},\theta_{i},x_{i})italic_θ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = sansserif_step ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

Note that our library also provides a utility function to construct 𝗌𝗍𝖾𝗉𝗌𝗍𝖾𝗉\mathsf{step}sansserif_step from its various pieces:

step = supervised_step(model, update, loss, learning_rate)

For an end-to-end example of model training and iteration, we refer the interested reader to the experiments accompanying the code [19].

6. Related Work

The work [26] is closely related to ours, in that it provides an abstract categorical model of backpropagation. However, it differs in a number of key aspects. We give a complete lens-theoretic explanation of what is back-propagated via (i) the use of CRDCs to model gradients; and (ii) the 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para construction to model parametric functions and parameter update. We thus can go well beyond [26] in terms of examples - their example of smooth functions and basic gradient descent is covered in our subsection 4.1.

We also explain some of the constructions of [26] in a more structured way. For example, rather than considering the category 𝐋𝐞𝐚𝐫𝐧𝐋𝐞𝐚𝐫𝐧\mathbf{Learn}bold_Learn of [26] as primitive, here we construct it as a composite of two more basic constructions (the 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para and 𝐋𝐞𝐧𝐬𝐋𝐞𝐧𝐬\mathbf{Lens}bold_Lens constructions). The flexibility could be used, for example, to compositionally replace 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para with a variant allowing parameters to come from a different category, or lenses with the category of optics [45] enabling us to model things such as control flow using prisms.

One more relevant aspect is functoriality. We use a functor to augment a parametric map with its backward pass, just like [26]. However, they additionally augmented this map with a loss map and gradient descent using a functor as well. This added extra conditions on the partial derivatives of the loss function: it needed to be invertible in the 2nd variable. This constraint was not justified in [26], nor is it a constraint that appears in machine learning practice. This led us to reexamine their constructions, coming up with our reformulation that does not require it. While loss maps and optimisers are mentioned in [26] as parts of the aforementioned functor, here they are extracted out and play a key role: loss maps are parametric lenses and optimisers are reparameterisations. Thus, in this paper we instead use 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para-composition to add the loss map to the model, and 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para 2-cells to add optimisers. The mentioned inverse of the partial derivative of the loss map in the 2nd𝑛𝑑{}^{nd}start_FLOATSUPERSCRIPT italic_n italic_d end_FLOATSUPERSCRIPT variable was also hypothesised to be relevant to deep dreaming. We have investigated this possibility thoroughly in our paper, showing it is gradient update which is used to dream up pictures. We also correct a small issue in Theorem III.2 of [26]. There, the morphisms of 𝐋𝐞𝐚𝐫𝐧𝐋𝐞𝐚𝐫𝐧\mathbf{Learn}bold_Learn were defined up to an equivalence (pg. 4 of [26]) but, unfortunately, the functor defined in Theorem III.2 does not respect this equivalence relation. Our approach instead uses 2-cells which comes from the universal property of 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para — a 2-cell from (P,f):AB:𝑃𝑓𝐴𝐵(P,f):A\to B( italic_P , italic_f ) : italic_A → italic_B to (Q,g):AB:𝑄𝑔𝐴𝐵(Q,g):A\to B( italic_Q , italic_g ) : italic_A → italic_B is a lens, and hence has two components: a map α:QP:𝛼𝑄𝑃\alpha:Q\to Pitalic_α : italic_Q → italic_P and α*:Q×PQ:superscript𝛼𝑄𝑃𝑄\alpha^{*}:Q\times P\to Qitalic_α start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT : italic_Q × italic_P → italic_Q. By comparison, we can see the equivalence relation of [26] as being induced by map α:QP:𝛼𝑄𝑃\alpha:Q\to Pitalic_α : italic_Q → italic_P, and not a lens. Our approach highlights the importance of the 2-categorical structure of learners. In addition, it does not treat the functor 𝐏𝐚𝐫𝐚(𝒞)𝐋𝐞𝐚𝐫𝐧𝐏𝐚𝐫𝐚𝒞𝐋𝐞𝐚𝐫𝐧\mathbf{Para}(\operatorname{\mathcal{C}})\to\mathbf{Learn}bold_Para ( caligraphic_C ) → bold_Learn as a primitive. In our case, this functor has the type 𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚(𝐋𝐞𝐧𝐬(𝒞))𝐏𝐚𝐫𝐚𝒞𝐏𝐚𝐫𝐚𝐋𝐞𝐧𝐬𝒞\mathbf{Para}(\operatorname{\mathcal{C}})\to\mathbf{Para}(\mathbf{Lens}(% \operatorname{\mathcal{C}}))bold_Para ( caligraphic_C ) → bold_Para ( bold_Lens ( caligraphic_C ) ) and arises from applying 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para to a canonical functor 𝒞𝐋𝐞𝐧𝐬(𝒞)𝒞𝐋𝐞𝐧𝐬𝒞\operatorname{\mathcal{C}}\to\mathbf{Lens}(\operatorname{\mathcal{C}})caligraphic_C → bold_Lens ( caligraphic_C ) existing for any reverse derivative category, not just 𝐒𝐦𝐨𝐨𝐭𝐡𝐒𝐦𝐨𝐨𝐭𝐡\mathbf{Smooth}bold_Smooth. Lastly, in our paper we took advantage of the graphical calculus for 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para, redrawing many diagrams appearing in [26] in a structured way.

Other than [26], there are a few more relevant papers. The work of [21] contains a sketch of some of the ideas this paper evolved from. They are based on the interplay of optics with parameterisation, albeit framed in the setting of diffeological spaces, and requiring cartesian and local cartesian closed structure on the base category. Lenses and Learners are studied in the eponymous work of [25] which observes that learners are parametric lenses. They do not explore any of the relevant 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para or CRDC structure, but make the distinction between symmetric and asymmetric lenses, studying how they are related to learners defined in [26]. A lens-like implementation of automatic differentiation is the focus of [24], but learning algorithms aren’t studied. A relationship between category-theoretic perspective on probabilistic modeling and gradient-based optimisation is studied in [49] which also studies a variant of the 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para construction. Usage of Cartesian differential categories to study learning is found in [55]. They extend the differential operator to work on stateful maps, but do not study lenses, parameterisation nor update maps. The work of [27] studies deep learning in the context of Cycle-consistent Generative Adversarial Networks [62] and formalises it via free and quotient categories, making parallels to the categorical formulations of database theory [53]. They do use the 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para construction, but do not relate it to lenses nor reverse derivative categories. A general survey of category theoretic approaches to machine learning, covering many of the above papers, can be found in [50]. Lastly, the concept of parametric lenses has started appearing in recent formulations of categorical game theory and cybernetics [10, 11]. The work of [10] generalises the study of parametric lenses into parametric optics and connects it to game thereotic concepts such as Nash equilibria.

7. Conclusions and Future Directions

We have given a categorical foundation of gradient-based learning algorithms which achieves a number of important goals. The foundation is principled and mathematically clean, based on the fundamental idea of a parametric lens. The foundation covers a wide variety of examples: different optimisers and loss maps in gradient-based learning, different architectures and layer structures, different settings where gradient-based learning happens (smooth functions vs. boolean circuits), adversarial unsupervised learning, and both learning of parameters and learning of inputs (deep dreaming). Finally, the foundation is more than a mere abstraction: we have also shown how it can be used to give a practical implementation of learning, as discussed in Section 5.

There are a number of important directions which are possible to explore because of this work. One of the most exciting ones is a more comprehensive study of neural network architectures through the category-theoretic perspective. Neural network architectures have begun to be studied using category theory adjacent machinery in the context of Geometric Deep Learning ([9]) and Topological Deep Learning ([42]). Recurrent neural networks, in particular, have been been studied in [55], in the context of differential categories and the concept of delayed trace introduced in the same paper. Despite this, a comprehensive categorical study of architectures is still missing in the literature. As first noticed in [41], many architectures such as recurrent and recursive neural network have close parallels to concepts in functional programming such as folds, unfolds and accumulating maps, for instance. As these functional concepts have clear categorical semantics, it is natural to ask whether these categorical semantics can be used to study neural network architectures. We believe the categorical framework presented in this paper can serve as a natural starting point for such a study. Future work includes modelling some classical systems as well, such as the Support Vector Machines [17], which should be possible with the usage of loss maps such as Hinge loss.

In all our settings we have fixed an optimiser beforehand. The work of [2] describes a meta-learning approach which sees the optimiser as a neural network whose parameters and gradient update rule can be learned. This is an exciting prospect since one can model optimisers as parametric lenses; and our framework covers learning with parametric lenses.

Future work also includes using the full power of CRDC axioms. In particular, axioms RD.6 or RD.7, which deal with the behaviour of higher-order derivatives, were not exploited in our work, but they should play a role in modelling some supervised learning algorithms using higher-order derivatives (for example, the Hessian) for additional optimisations. Taking this idea in a different direction, one can see that much of our work can be applied to any functor of the form F:𝒞𝐋𝐞𝐧𝐬(𝒞):𝐹𝒞𝐋𝐞𝐧𝐬𝒞F:\operatorname{\mathcal{C}}\to\mathbf{Lens}(\operatorname{\mathcal{C}})italic_F : caligraphic_C → bold_Lens ( caligraphic_C ) - F𝐹Fitalic_F does not necessarily have to be of the form f(fR[f])maps-to𝑓FRACOP𝑓𝑅delimited-[]𝑓f\mapsto\left({{f}\atop{R[f]}}\right)italic_f ↦ ( FRACOP start_ARG italic_f end_ARG start_ARG italic_R [ italic_f ] end_ARG ) for a CRDC R𝑅Ritalic_R. Moreover, by working with more generalised forms of the lens category (such as dependent lenses), we may be able to capture ideas related to supervised learning on manifolds. And, of course, we can vary the parameter space to endow it with different structure from the functions we wish to learn. In this vein, we wish to use fibrations/dependent types to model the use of tangent bundles: this would foster the extension of the correct by construction paradigm to machine learning, and thereby addressing the widely acknowledged problem of trusted machine learning. The possibilities are made much easier by the compositional nature of our framework. Another key topic for future work is to link gradient-based learning with game theory. At a high level, the former takes little incremental steps to achieve an equilibrium while the later aims to do so in one fell swoop. Formalising this intuition is possible with our lens-based framework and the lens-based framework for game theory [28]. Finally, because our framework is quite general, in future work we plan to consider further modifications and additions to encompass probabilistic, non-gradient based, and other forms of non-supervised learning. This includes genetic algorithms and reinforcement learning.


Acknowledgements Fabio Zanasi acknowledges support from epsrc EP/V002376/1. Geoff Cruttwell acknowledges support from NSERC.

References

  • [1] S. Abramsky and B. Coecke. A categorical semantics of quantum protocols. In Proceedings of the 19th Annual IEEE Symposium on Logic in Computer Science, 2004., pages 415–425, 2004.
  • [2] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In 30th Conference on Neural Information Processings Systems (NIPS), 2016.
  • [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. arXiv:1701.07875, 2017.
  • [4] John C. Baez and Jason Erbele. Categories in Control. Theory and Applications of Categories, 30(24):836–881, 2015.
  • [5] R. Blute, R. Cockett, and R. Seely. Cartesian Differential Categories. Theory and Applications of Categories, 22(23):622–672, 2009.
  • [6] Aaron Bohannon, J. Nathan Foster, Benjamin C. Pierce, Alexandre Pilkiewicz, and Alan Schmitt. Boomerang: Resourceful lenses for string data. SIGPLAN Not., 43(1):407-419, 2008.
  • [7] Guillaume Boisseau. String Diagrams for Optics. arXiv:2002.11480, 2020.
  • [8] Filippo Bonchi, Pawel Sobocinski, and Fabio Zanasi. The calculus of signal flow diagrams I: linear relations on streams. Inf. Comput., 252:2–29, 2017.
  • [9] Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Velickovic. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. arXiv:2104.13478, 2021.
  • [10] Matteo Capucci, Bruno Gavranovic, Jules Hedges, and E. F. Rischel. Towards foundations of categorical cybernetics. arXiv:2105.06332, 2021.
  • [11] Matteo Capucci, Neil Ghani, Jérémy Ledent, and Fredrik Nordvall Forsberg. Translating Extensive Form Games to Open Games with Agency. arXiv:2105.06763, 2021.
  • [12] François Chollet et al. Keras. https://keras.io, 2015.
  • [13] Bryce Clarke, Derek Elkins, Jeremy Gibbons, Fosco Loregian, Bartosz Milewski, Emily Pillmore, and Mario Román. Profunctor optics, a categorical update. arXiv:2001.07488, 2020.
  • [14] J. R. B. Cockett and G. S. H. Cruttwell. Differential Structure, Tangent Structure, and SDG. Applied Categorical Structures, 22(2):331–417, 2014.
  • [15] J. Robin B. Cockett, Geoff S. H. Cruttwell, Jonathan Gallagher, Jean-Simon Pacaud Lemay, Benjamin MacAdam, Gordon D. Plotkin, and Dorette Pronk. Reverse derivative categories. In Proceedings of the 28th Computer Science Logic (CSL) conference, 2020.
  • [16] Bob Coecke and Aleks Kissinger. Picturing Quantum Processes: A First Course in Quantum Theory and Diagrammatic Reasoning. Cambridge University Press, 2017.
  • [17] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
  • [18] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. arXiv:1511.00363, 2016.
  • [19] Anonymous CRCoauthors. Numeric Optics: A python library for constructing and training neural networks based on lenses and reverse derivatives. https://github.com/anonymous-c0de/esop-2022, 2022.
  • [20] Geoffrey S. H. Cruttwell, Bruno Gavranovic, Neil Ghani, Paul W. Wilson, and Fabio Zanasi. Categorical foundations of gradient-based learning. In ESOP, volume 13240 of Lecture Notes in Computer Science, pages 1–28. Springer, 2022.
  • [21] David Dalrymple. Dioptics: a common generalization of open games and gradient-based learners. SYCO7, 2019.
  • [22] Alexey Dosovitskiy and Thomas Brox. Inverting convolutional networks with convolutional networks. arXiv:1506.02753, 2015.
  • [23] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • [24] Conal Elliott. The simple essence of automatic differentiation (differentiable functional programming made easy). arXiv:1804.00746, 2018.
  • [25] Brendan Fong and Michael Johnson. Lenses and learners. In Proceedings of the 8th International Workshop on Bidirectional transformations (Bx@PLW), 2019.
  • [26] Brendan Fong, David I. Spivak, and Rémy Tuyéras. Backprop as functor: A compositional perspective on supervised learning. In Proceedings of the Thirty fourth Annual IEEE Symposium on Logic in Computer Science (LICS 2019), pages 1–13. IEEE Computer Society Press, 2019.
  • [27] Bruno Gavranovic. Compositional deep learning. arXiv:1907.08292, 2019.
  • [28] Neil Ghani, Jules Hedges, Viktor Winschel, and Philipp Zahn. Compositional game theory. In Proceedings of the 33rd Annual ACM/IEEE Symposium on Logic in Computer Science, LICS ’18, page 472-481, 2018.
  • [29] Dan R. Ghica, Achim Jung, and Aliaume Lopez. Diagrammatic Semantics for Digital Circuits. arXiv:1703.10247, 2017.
  • [30] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adversarial Networks. arXiv:1406.2661, 2014.
  • [31] Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of algorithmic differentiation. Society for Industrial and Applied Mathematics, 2008.
  • [32] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
  • [33] Jules Hedges. Limits of bimorphic lenses. arXiv:1808.05545, 2018.
  • [34] C. Hermida and R. D. Tennent. Monoidal indeterminates and categories of possible worlds. Theor. Comput. Sci., 430:3-22, 2012.
  • [35] Michael Johnson, Robert Rosebrugh, and R.J. Wood. Lenses, fibrations and universal translations. Mathematical structures in computer science, 22:25–42, 2012.
  • [36] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • [37] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
  • [38] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. arXiv:1412.0035, 2014.
  • [39] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going Deeper into Neural Networks, 2015.
  • [40] Anh Mai Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. arXiv:1412.1897, 2014.
  • [41] Christopher Olah. Neural Networks, Types, and Functional Programming – colah’s blog, 2015.
  • [42] Mathilde Papillon, Sophia Sanborn, Mustafa Hajij, and Nina Miolane. Architectures of Topological Deep Learning: A Survey on Topological Neural Networks. arXiv:2304.10031, 2023.
  • [43] Robin Piedeleu and Fabio Zanasi. An introduction to string diagrams for computer scientists. arXiv:2305.08768, 2023.
  • [44] B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 – 17, 1964.
  • [45] Mitchell Riley. Categories of optics. arXiv:1809.00738, 2018.
  • [46] Florian Schäfer and Anima Anandkumar. Competitive Gradient Descent, June 2020. arXiv:1905.12103, 2020.
  • [47] Peter Selinger. Control categories and duality: on the categorical semantics of the lambda-mu calculus. Mathematical Structures in Computer Science, 11(02):207–260, 4, 2001.
  • [48] Sanjit A. Seshia and Dorsa Sadigh. Towards verified artificial intelligence. arXiv:1606.08514, 2016.
  • [49] Dan Shiebler. Categorical Stochastic Processes and Likelihood. Compositionality, 3(1), 2021.
  • [50] Dan Shiebler, Bruno Gavranović, and Paul Wilson. Category Theory in Machine Learning. arXiv:2106.07032, 2021.
  • [51] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034, 2014.
  • [52] The Royal Society. Explainable AI: the basics - policy briefing, 2019.
  • [53] David I. Spivak. Functorial data migration. arXiv:1009.1166, 2010.
  • [54] David I. Spivak. Generalized Lens Categories via functors $\mathcal{C}^{\rm op}\to\mathsf{Cat}$, March 2022. arXiv:1908.02202, 2022.
  • [55] David Sprunger and Shin-ya Katsumata. Differentiable causal computations via delayed trace. In Proceedings of the 34th Annual ACM/IEEE Symposium on Logic in Computer Science, LICS ’19. IEEE Press, 2019.
  • [56] Albert Steckermeier. Lenses in functional programming. Preprint, available at https://sinusoid.es/misc/lager/lenses.pdf, 2015.
  • [57] Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013, 2013.
  • [58] D. Turi and G. Plotkin. Towards a mathematical operational semantics. In Proceedings of Twelfth Annual IEEE Symposium on Logic in Computer Science, pages 280–291, 1997.
  • [59] Paul Wilson and Fabio Zanasi. Reverse derivative ascent: A categorical approach to learning boolean circuits. In Proceedings of Applied Category Theory (ACT), 2020.
  • [60] Paul Wilson and Fabio Zanasi. An axiomatic approach to differentiation of polynomial circuits. Journal of Logical and Algebraic Methods in Programming, 135:100892, 2023.
  • [61] Paul W. Wilson and Fabio Zanasi. Categories of differentiable polynomial circuits for machine learning. In ICGT, volume 13349 of Lecture Notes in Computer Science, pages 77–93. Springer, 2022.
  • [62] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. arXiv:1703.10593, 2017.

Appendix A More details on Parametric Categories

As mentioned in the main text, coherence rules in combining the two operations in (2) just work as expected, in the sense that these diagrams can be ultimately ‘compiled’ down to string diagrams for monoidal categories. For example, given maps (P,f):AB:𝑃𝑓𝐴𝐵(P,f):A\to B( italic_P , italic_f ) : italic_A → italic_B, (Q,g):BC:𝑄𝑔𝐵𝐶(Q,g):B\to C( italic_Q , italic_g ) : italic_B → italic_C with reparametrisations α:PP:𝛼superscript𝑃𝑃\alpha:P^{\prime}\to Pitalic_α : italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_P, β:QQ:𝛽superscript𝑄𝑄\beta:Q^{\prime}\to Qitalic_β : italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT → italic_Q, one could either first reparametrise f𝑓fitalic_f and g𝑔gitalic_g separately and then compose the results (below left), or compose first then reparametrise jointly (below right):

{tikzpicture}                   
{tikzpicture}
(13)

As expected, translating these two operations into string diagrams for monoidal categories yield equivalent representations of the same morphism.

{tikzpicture}={tikzpicture}{tikzpicture}{tikzpicture}\scalebox{0.8}{ { \begin{tikzpicture} }}\qquad=\qquad\scalebox{0.8}{{ \begin{tikzpicture} }}= (14)
Remark 4.

There is a 2-categorical perspective on 𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚𝒞\mathbf{Para}(\operatorname{\mathcal{C}})bold_Para ( caligraphic_C ), which we glossed over in this paper for the sake of simplicity. In particular, the reparametrisations described above can also be seen as equipping 𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚𝒞\mathbf{Para}(\operatorname{\mathcal{C}})bold_Para ( caligraphic_C ) with 2-cells, giving a 2-categorical structure on 𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚𝒞\mathbf{Para}(\operatorname{\mathcal{C}})bold_Para ( caligraphic_C ). This is also coherent with respect to base change: if 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C and 𝒟𝒟\operatorname{\mathcal{D}}caligraphic_D are strict symmetric monoidal categories, and F:𝒞𝒟normal-:𝐹normal-→𝒞𝒟F\colon\operatorname{\mathcal{C}}\to\operatorname{\mathcal{D}}italic_F : caligraphic_C → caligraphic_D a lax symmetric monoidal functor, then there is an induced 2-functor 𝐏𝐚𝐫𝐚(F):𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚(𝒟)normal-:𝐏𝐚𝐫𝐚𝐹normal-→𝐏𝐚𝐫𝐚𝒞𝐏𝐚𝐫𝐚𝒟\mathbf{Para}(F)\colon\mathbf{Para}(\operatorname{\mathcal{C}})\to\mathbf{Para% }(\operatorname{\mathcal{D}})bold_Para ( italic_F ) : bold_Para ( caligraphic_C ) → bold_Para ( caligraphic_D ) which agrees with F𝐹Fitalic_F on objects. This 2-functor is straightforward: for a 1-cell (P,f):ABnormal-:𝑃𝑓normal-→𝐴𝐵(P,f):A\to B( italic_P , italic_f ) : italic_A → italic_B, it applies F𝐹Fitalic_F to P𝑃Pitalic_P and f𝑓fitalic_f and uses the (lax) comparison to get a map of the correct type. We will see how this base change becomes important when performing backpropagation on parametric maps (Eq. 5)

Lastly, we mention that 𝐏𝐚𝐫𝐚(𝒞)𝐏𝐚𝐫𝐚𝒞\mathbf{Para}(\operatorname{\mathcal{C}})bold_Para ( caligraphic_C ) inherits the symmetric monoidal structure from 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C and that the induced 2-functor 𝐏𝐚𝐫𝐚(F)𝐏𝐚𝐫𝐚𝐹\mathbf{Para}(F)bold_Para ( italic_F ) respects that structure. This will allow us to compose neural networks not only in series, but also in parallel. For more detail on alternative viewpoints on the 𝐏𝐚𝐫𝐚𝐏𝐚𝐫𝐚\mathbf{Para}bold_Para construction, including how it can be viewed as the Grothendieck construction of a certain indexed category, see [10].

Appendix B Background on Cartesian Reverse Differential Categories

Here we briefly review the definitions of Cartesian left additive category (CLAC), Cartesian reverse differential category (CRDC) and additive and linear maps in these categories. Note that in this appendix we follow the convention of [15] and write composition in diagrammatic order by juxtaposition of terms (rather than a semicolon) to shorten the form of many of the expressions.

{defi}

A category 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C is said to be Cartesian when there are chosen binary products ×\times×, with projection maps πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and pairing operation ,\langle-,-\rangle⟨ - , - ⟩, and a chosen terminal object T𝑇Titalic_T, with unique maps !!! to the terminal object.

{defi}

A left additive category [5, Definition 1.1.1] (CLAC) is a category 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C such that each hom-set has commutative monoid structure, with addition operation +++ and zero maps 0, such that composition on the left preserves the additive structure: for any appropriate f,g,h𝑓𝑔f,g,hitalic_f , italic_g , italic_h, f(g+h)=fg+fh𝑓𝑔𝑓𝑔𝑓f(g+h)=fg+fhitalic_f ( italic_g + italic_h ) = italic_f italic_g + italic_f italic_h and f0=0𝑓00f0=0italic_f 0 = 0.

{defi}

A map h:XY:𝑋𝑌h:X\to Yitalic_h : italic_X → italic_Y in a CLAC is additive if it has the property that it preserves additive structure by composition on the right: for any maps x,y:ZX:𝑥𝑦𝑍𝑋x,y:Z\to Xitalic_x , italic_y : italic_Z → italic_X, (x+y);h=x;h+y;hformulae-sequence𝑥𝑦𝑥𝑦(x+y);h=x;h+y;h( italic_x + italic_y ) ; italic_h = italic_x ; italic_h + italic_y ; italic_h, and 0;h=0000;h=00 ; italic_h = 0.

{defi}

[Additive in second component, (compare [5, Lemma 1.2.3])] A morphism f:X×AB:𝑓𝑋𝐴𝐵f:X\times A\to Bitalic_f : italic_X × italic_A → italic_B is additive in the variable A𝐴Aitalic_A if it is an additive morphism of type AB𝐴𝐵A\to Bitalic_A → italic_B in the cartesian left-additive category 𝐂𝐨𝐊𝐥(X×)\mathbf{CoKl}(X\times-)bold_CoKl ( italic_X × - ), where 𝐂𝐨𝐊𝐥(X×)\mathbf{CoKl}(X\times-)bold_CoKl ( italic_X × - ) is the coKleisli category of the coreader comonad888There are a few other terms for this. One of them is “the writer comonad”, though this is often confused with the writer monad which additionally necessitates a monoid structure on X𝑋Xitalic_X. It’s also called reader comonad, because of duality to reader monad, and also “product comonad” or “environment comonad”..

{defi}

A Cartesian left additive category [5, Definition 1.2.1] is a left additive category 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C which is Cartesian and such that all projection maps πisubscript𝜋𝑖\pi_{i}italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are additive.

{defi}

We call 𝐂𝐋𝐀𝐂𝐚𝐭𝐂𝐋𝐀𝐂𝐚𝐭\mathbf{CLACat}bold_CLACat the category whose objects are cartesian left-additive categories and whose morphisms are cartesian left-additive functors (functors which preserve products and commutative monoid structure on objects ([5, Definition 1.3.1])).

Lemma 5.

Let 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C and 𝒟𝒟\operatorname{\mathcal{D}}caligraphic_D be cartesian left-additive categories, and F:𝒞𝒟normal-:𝐹normal-→𝒞𝒟F:\operatorname{\mathcal{C}}\to\operatorname{\mathcal{D}}italic_F : caligraphic_C → caligraphic_D a a cartesian left-additive functor. Let f:ABnormal-:𝑓normal-→𝐴𝐵f:A\to Bitalic_f : italic_A → italic_B be an additive morphism in 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C. Then F(f):F(A)F(B)normal-:𝐹𝑓normal-→𝐹𝐴𝐹𝐵F(f):F(A)\to F(B)italic_F ( italic_f ) : italic_F ( italic_A ) → italic_F ( italic_B ) is also additive.

The central definition of [15] is the following:

{defi}

A Cartesian reverse differential category (CRDC) is a Cartesian left additive category 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C which has, for each map f:AB:𝑓𝐴𝐵f:A\to Bitalic_f : italic_A → italic_B in 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C, a map

R[f]:A×BA:𝑅delimited-[]𝑓𝐴𝐵𝐴R[f]:A\times B\to Aitalic_R [ italic_f ] : italic_A × italic_B → italic_A

satisfying seven axioms:
[RD.1] R[f+g]=R[f]+R[g]𝑅delimited-[]𝑓𝑔𝑅delimited-[]𝑓𝑅delimited-[]𝑔R[f+g]=R[f]+R[g]italic_R [ italic_f + italic_g ] = italic_R [ italic_f ] + italic_R [ italic_g ] and R[0]=0𝑅delimited-[]00R[0]=0italic_R [ 0 ] = 0.
[RD.2] a,b+cR[f]=a,bR[f]+a,cR[f]𝑎𝑏𝑐𝑅delimited-[]𝑓𝑎𝑏𝑅delimited-[]𝑓𝑎𝑐𝑅delimited-[]𝑓\langle a,b+c\rangle R[f]=\langle a,b\rangle R[f]+\langle a,c\rangle R[f]⟨ italic_a , italic_b + italic_c ⟩ italic_R [ italic_f ] = ⟨ italic_a , italic_b ⟩ italic_R [ italic_f ] + ⟨ italic_a , italic_c ⟩ italic_R [ italic_f ] and a,0R[f]=0𝑎0𝑅delimited-[]𝑓0\langle a,0\rangle R[f]=0⟨ italic_a , 0 ⟩ italic_R [ italic_f ] = 0.
[RD.3] R[1]=π1𝑅delimited-[]1subscript𝜋1R[1]=\pi_{1}italic_R [ 1 ] = italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, R[π0]=π1ι0𝑅delimited-[]subscript𝜋0subscript𝜋1subscript𝜄0R[\pi_{0}]=\pi_{1}\iota_{0}italic_R [ italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] = italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ι start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and R[π1]=π1ι1𝑅delimited-[]subscript𝜋1subscript𝜋1subscript𝜄1R[\pi_{1}]=\pi_{1}\iota_{1}italic_R [ italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] = italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_ι start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. [RD.4] For a tupling of maps f𝑓fitalic_f and g𝑔gitalic_g, the following equality holds:

R[f,g]=(1×π0);R[f]+(1×π1);R[g]𝑅delimited-[]𝑓𝑔1subscript𝜋0𝑅delimited-[]𝑓1subscript𝜋1𝑅delimited-[]𝑔R[\langle f,g\rangle]=(1\times\pi_{0});R[f]+(1\times\pi_{1});R[g]italic_R [ ⟨ italic_f , italic_g ⟩ ] = ( 1 × italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ; italic_R [ italic_f ] + ( 1 × italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ; italic_R [ italic_g ]

And if !A:AT!_{A}:A\to T! start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT : italic_A → italic_T is the unique map to the terminal object, R[!A]=0R[!_{A}]=0italic_R [ ! start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ] = 0.

[RD.5] For composable maps f𝑓fitalic_f and g𝑔gitalic_g,

R[fg]=π0,π0f,π1(1×R[g])R[f]R[fg]=\langle\pi_{0},\pi_{0}f,\pi_{1}\rangle\rangle(1\times R[g])R[f]italic_R [ italic_f italic_g ] = ⟨ italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_f , italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ⟩ ( 1 × italic_R [ italic_g ] ) italic_R [ italic_f ]

[RD.6]

1×π0,0×π1(ι0×1)R[R[R[f]]]π1=(1×π1)R[f].1subscript𝜋00subscript𝜋1subscript𝜄01𝑅delimited-[]𝑅delimited-[]𝑅delimited-[]𝑓subscript𝜋11subscript𝜋1𝑅delimited-[]𝑓\langle 1\times\pi_{0},0\times\pi_{1}\rangle(\iota_{0}\times 1)R[R[R[f]]]\pi_{% 1}=(1\times\pi_{1})R[f].⟨ 1 × italic_π start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 × italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ( italic_ι start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × 1 ) italic_R [ italic_R [ italic_R [ italic_f ] ] ] italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 1 × italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_R [ italic_f ] .

[RD.7]

(ι0×1);R[R[(ι0×1)R[R[f]]π1]];π1=𝖾𝗑;(ι0×1)R[R[(ι0×1)R[R[f]]π1]]π1formulae-sequencesubscript𝜄01𝑅delimited-[]𝑅delimited-[]subscript𝜄01𝑅delimited-[]𝑅delimited-[]𝑓subscript𝜋1subscript𝜋1𝖾𝗑subscript𝜄01𝑅delimited-[]𝑅delimited-[]subscript𝜄01𝑅delimited-[]𝑅delimited-[]𝑓subscript𝜋1subscript𝜋1(\iota_{0}\times 1);R[R[(\iota_{0}\times 1)R[R[f]]\pi_{1}]];\pi_{1}=\mathsf{ex% };(\iota_{0}\times 1)R[R[(\iota_{0}\times 1)R[R[f]]\pi_{1}]]\pi_{1}( italic_ι start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × 1 ) ; italic_R [ italic_R [ ( italic_ι start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × 1 ) italic_R [ italic_R [ italic_f ] ] italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ] ; italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = sansserif_ex ; ( italic_ι start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × 1 ) italic_R [ italic_R [ ( italic_ι start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × 1 ) italic_R [ italic_R [ italic_f ] ] italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ] italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

(where 𝖾𝗑𝖾𝗑\mathsf{ex}sansserif_ex is the map that exchanges the middle two variables).

As discussed in [15], these axioms correspond to familiar properties of the reverse derivative:

  • [RD.1] says that differentiation preserves addition of maps, while [RD.2] says that differentiation is additive in its vector variable.

  • [RD.3] and [RD.4] handle the derivatives of identities, projections, and tuples.

  • [RD.5] is the (reverse) chain rule.

  • [RD.6] says that the reverse derivative is linear in its vector variable.

  • [RD.7] expresses the independence of order of mixed partial derivatives.

We proceed to prove the following theorem in three steps. \LensFunctorCLAC*

The first step is formally defining the category 𝐋𝐞𝐧𝐬A(𝒞)subscript𝐋𝐞𝐧𝐬𝐴𝒞\mathbf{Lens}_{A}(\operatorname{\mathcal{C}})bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ). {defi} Let 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C be a cartesian left-additive category. Then 𝐋𝐞𝐧𝐬A(𝒞)subscript𝐋𝐞𝐧𝐬𝐴𝒞\mathbf{Lens}_{A}(\operatorname{\mathcal{C}})bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ) is a wide subcategory of 𝐋𝐞𝐧𝐬(𝒞)𝐋𝐞𝐧𝐬𝒞\mathbf{Lens}(\operatorname{\mathcal{C}})bold_Lens ( caligraphic_C ) where

𝐋𝐞𝐧𝐬A(𝒞)(AA,BB)𝒞(A,B)×𝐂𝐨𝐊𝐥(A×)A(B,A)\mathbf{Lens}_{A}(\operatorname{\mathcal{C}})\Bigg{(}\begin{matrix}{A}\\ {A^{\prime}}\end{matrix}\,,\begin{matrix}{B}\\ {B^{\prime}}\end{matrix}\Bigg{)}\coloneqq\operatorname{\mathcal{C}}(A,B)\times% \mathbf{CoKl}(A\times-)_{A}(B^{\prime},A^{\prime})bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ) ( start_ARG start_ROW start_CELL italic_A end_CELL end_ROW start_ROW start_CELL italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG , start_ARG start_ROW start_CELL italic_B end_CELL end_ROW start_ROW start_CELL italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) ≔ caligraphic_C ( italic_A , italic_B ) × bold_CoKl ( italic_A × - ) start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )

Compare this with the defintion of 𝐋𝐞𝐧𝐬(𝒞)𝐋𝐞𝐧𝐬𝒞\mathbf{Lens}(\operatorname{\mathcal{C}})bold_Lens ( caligraphic_C ) via the Grothendieck construction ([54, Prop. 3.10]) where

𝐋𝐞𝐧𝐬(𝒞)(AA,BB)𝒞(A,B)×𝐂𝐨𝐊𝐥(A×)(B,A)\mathbf{Lens}(\operatorname{\mathcal{C}})\Bigg{(}\begin{matrix}{A}\\ {A^{\prime}}\end{matrix}\,,\begin{matrix}{B}\\ {B^{\prime}}\end{matrix}\Bigg{)}\coloneqq\operatorname{\mathcal{C}}(A,B)\times% \mathbf{CoKl}(A\times-)(B^{\prime},A^{\prime})bold_Lens ( caligraphic_C ) ( start_ARG start_ROW start_CELL italic_A end_CELL end_ROW start_ROW start_CELL italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG , start_ARG start_ROW start_CELL italic_B end_CELL end_ROW start_ROW start_CELL italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) ≔ caligraphic_C ( italic_A , italic_B ) × bold_CoKl ( italic_A × - ) ( italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )

The second step is showing this category is cartesian left-additive.

Proposition 6.

The category 𝐋𝐞𝐧𝐬A(𝒞)subscript𝐋𝐞𝐧𝐬𝐴𝒞\mathbf{Lens}_{A}(\operatorname{\mathcal{C}})bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ) is cartesian left-additive.

Proof B.1.

We need to equip 𝐋𝐞𝐧𝐬A(𝒞)subscript𝐋𝐞𝐧𝐬𝐴𝒞\mathbf{Lens}_{A}(\operatorname{\mathcal{C}})bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ) with a commutative monoid on every object in a way that’s compatible with the cartesian structure.999We don’t need to show that this monoid is unique, just that it exists and can be canonically defined. That is, for every object (AA)𝐹𝑅𝐴𝐶𝑂𝑃𝐴superscript𝐴normal-′\left({{A}\atop{A^{\prime}}}\right)( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) we need to provide two morphisms:

  • Unit 0(AA):(11)(AA):subscript0FRACOP𝐴superscript𝐴FRACOP11FRACOP𝐴superscript𝐴0_{\left({{A}\atop{A^{\prime}}}\right)}:\left({{1}\atop{1}}\right)\to\left({{A% }\atop{A^{\prime}}}\right)0 start_POSTSUBSCRIPT ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) end_POSTSUBSCRIPT : ( FRACOP start_ARG 1 end_ARG start_ARG 1 end_ARG ) → ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ). This is a lens whose forward map we set as the zero map 0Asubscript0𝐴0_{A}0 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and the backward map as the delete !1×A!_{1\times A^{\prime}}! start_POSTSUBSCRIPT 1 × italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT.

  • Multiplication +(AA):(A×AA×A)(AA)+_{\left({{A}\atop{A^{\prime}}}\right)}:\left({{A\times A}\atop{A^{\prime}% \times A^{\prime}}}\right)\to\left({{A}\atop{A^{\prime}}}\right)+ start_POSTSUBSCRIPT ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) end_POSTSUBSCRIPT : ( FRACOP start_ARG italic_A × italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) → ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ). This is a lens whose forward map we set as sum +Asubscript𝐴+_{A}+ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT and the backward map as copy, i.e. (A×A)×Aπ2AΔAA×Asubscript𝜋2𝐴𝐴superscript𝐴superscript𝐴subscriptΔsuperscript𝐴superscript𝐴superscript𝐴(A\times A)\times A^{\prime}\xrightarrow{\pi_{2}}A^{\prime}\xrightarrow{\Delta% _{A^{\prime}}}A^{\prime}\times A^{\prime}( italic_A × italic_A ) × italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_ARROW start_OVERACCENT roman_Δ start_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_OVERACCENT → end_ARROW italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Additionally, these morphisms need to obey the monoid laws. This can be verified by routine.

This defines the action of 𝐋𝐞𝐧𝐬Asubscript𝐋𝐞𝐧𝐬𝐴\mathbf{Lens}_{A}bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT on objects of 𝐂𝐋𝐀𝐂𝐚𝐭𝐂𝐋𝐀𝐂𝐚𝐭\mathbf{CLACat}bold_CLACat. Action on morphisms is defined below.

Proposition 7.

Let F:𝒞𝒟normal-:𝐹normal-→𝒞𝒟F:\operatorname{\mathcal{C}}\to\operatorname{\mathcal{D}}italic_F : caligraphic_C → caligraphic_D be a cartesian left-additive functor. This induces a cartesian left-additive functor 𝐋𝐞𝐧𝐬A(F)subscript𝐋𝐞𝐧𝐬𝐴𝐹\mathbf{Lens}_{A}(F)bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_F ) between the corresponding categories of lenses:

{tikzcd}{tikzcd}\begin{tikzcd} (15)

where f*¯F(A)×F(B)F(A×B)F(f)F(A)normal-≔normal-¯superscript𝑓𝐹superscript𝐴normal-′𝐹superscript𝐵normal-′𝐹superscript𝐴normal-′superscript𝐵normal-′𝐹superscript𝑓normal-′normal-→𝐹superscript𝐴normal-′\overline{f^{*}}\coloneqq F(A^{\prime})\times F(B^{\prime})\cong F(A^{\prime}% \times B^{\prime})\xrightarrow{F(f^{\prime})}F(A^{\prime})over¯ start_ARG italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG ≔ italic_F ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) × italic_F ( italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≅ italic_F ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_ARROW start_OVERACCENT italic_F ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_OVERACCENT → end_ARROW italic_F ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

Proof B.2.

We need to prove that 𝐋𝐞𝐧𝐬A(F)subscript𝐋𝐞𝐧𝐬𝐴𝐹\mathbf{Lens}_{A}(F)bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_F ) is a cartesian left-additive functor. To prove it is a functor, we need to:

  • Define its action on objects and morphisms. We have done this in Prop. 7 itself;

  • Prove additivity of f*¯¯superscript𝑓\overline{f^{*}}over¯ start_ARG italic_f start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT end_ARG. This follows from Lemma. 5;

  • Prove identities are preserved. The identity (𝑖𝑑Aπ2):(AA)(AA):FRACOPsubscript𝑖𝑑𝐴subscript𝜋2FRACOP𝐴superscript𝐴FRACOP𝐴superscript𝐴\left({{\text{id}_{A}}\atop{\pi_{2}}}\right):\left({{A}\atop{A^{\prime}}}% \right)\to\left({{A}\atop{A^{\prime}}}\right)( FRACOP start_ARG id start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) : ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) → ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) in the domain gets mapped to (F(𝑖𝑑A)F(π2))FRACOP𝐹subscript𝑖𝑑𝐴𝐹subscript𝜋2\left({{F(\text{id}_{A})}\atop{F(\pi_{2})}}\right)( FRACOP start_ARG italic_F ( id start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) end_ARG start_ARG italic_F ( italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ). By preservation of identities and products of F𝐹Fitalic_F this is equal to the identity map on (F(A)F(A))FRACOP𝐹𝐴𝐹superscript𝐴\left({{F(A)}\atop{F(A^{\prime})}}\right)( FRACOP start_ARG italic_F ( italic_A ) end_ARG start_ARG italic_F ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ).

  • Prove composition is preserved. This can be be by routine, albeit tedious calculation.

To prove that it is additionally cartesian, we need to show that the image of every comonoid ((AA),!(AA),Δ(AA))(\left({{A}\atop{A^{\prime}}}\right),!_{\left({{A}\atop{A^{\prime}}}\right)},% \Delta_{\left({{A}\atop{A^{\prime}}}\right)})( ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) , ! start_POSTSUBSCRIPT ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) end_POSTSUBSCRIPT , roman_Δ start_POSTSUBSCRIPT ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) end_POSTSUBSCRIPT ) is also a comonoid, and that all maps preserve comonoids. We can understand the first part in terms of actions on the counit and comultiplication of the comonoid.

  • Counit. The action on the counit unpacks to the pair (F(!(AA))F(!A×1;0A))\left({{F(!_{\left({{A}\atop{A^{\prime}}}\right)})}\atop{F(!_{A\times 1};0_{A}% )}}\right)( FRACOP start_ARG italic_F ( ! start_POSTSUBSCRIPT ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) end_POSTSUBSCRIPT ) end_ARG start_ARG italic_F ( ! start_POSTSUBSCRIPT italic_A × 1 end_POSTSUBSCRIPT ; 0 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) end_ARG ). By preservation of terminal and additive maps of F𝐹Fitalic_F this morphism is equal to the counit of (F(A)F(A))FRACOP𝐹𝐴𝐹superscript𝐴\left({{F(A)}\atop{F(A^{\prime})}}\right)( FRACOP start_ARG italic_F ( italic_A ) end_ARG start_ARG italic_F ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ).

  • Comultiplication. The action on the comultiplication unpacks to ((F(ΔA)F(π2,3;+A))\left({{(F(\Delta_{A})}\atop{F(\pi_{2,3};+_{A})}}\right)( FRACOP start_ARG ( italic_F ( roman_Δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) end_ARG start_ARG italic_F ( italic_π start_POSTSUBSCRIPT 2 , 3 end_POSTSUBSCRIPT ; + start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) end_ARG ). By F𝐹Fitalic_F’s preservation of products and additive morphisms this morphism is equal to the comultiplication of (F(A)F(A))FRACOP𝐹𝐴𝐹superscript𝐴\left({{F(A)}\atop{F(A^{\prime})}}\right)( FRACOP start_ARG italic_F ( italic_A ) end_ARG start_ARG italic_F ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ).

It is routine to show it obey the corresponding laws and form a comonoid.

For the second part we need to show that the image of every lens (ff):(AA)(BB)normal-:𝐹𝑅𝐴𝐶𝑂𝑃𝑓superscript𝑓normal-♯normal-→𝐹𝑅𝐴𝐶𝑂𝑃𝐴superscript𝐴normal-′𝐹𝑅𝐴𝐶𝑂𝑃𝐵superscript𝐵normal-′\left({{f}\atop{f^{\sharp}}}\right):\left({{A}\atop{A^{\prime}}}\right)\to% \left({{B}\atop{B^{\prime}}}\right)( FRACOP start_ARG italic_f end_ARG start_ARG italic_f start_POSTSUPERSCRIPT ♯ end_POSTSUPERSCRIPT end_ARG ) : ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) → ( FRACOP start_ARG italic_B end_ARG start_ARG italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) preserves these comonoids. For the forward part this is true because F𝐹Fitalic_F preserves products. For the backwards part this is true because F𝐹Fitalic_F is left-additive.

Lastly, we need to prove that this functor is additionally left-additive. This means that it preserves the monoid ((AA),0(AA),+(AA))𝐹𝑅𝐴𝐶𝑂𝑃𝐴superscript𝐴normal-′subscript0𝐹𝑅𝐴𝐶𝑂𝑃𝐴superscript𝐴normal-′subscript𝐹𝑅𝐴𝐶𝑂𝑃𝐴superscript𝐴normal-′(\left({{A}\atop{A^{\prime}}}\right),0_{\left({{A}\atop{A^{\prime}}}\right)},+% _{\left({{A}\atop{A^{\prime}}}\right)})( ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) , 0 start_POSTSUBSCRIPT ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) end_POSTSUBSCRIPT , + start_POSTSUBSCRIPT ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) end_POSTSUBSCRIPT ) of every object. We unpack the action of 𝐋𝐞𝐧𝐬A(F)subscript𝐋𝐞𝐧𝐬𝐴𝐹\mathbf{Lens}_{A}(F)bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_F ) on the unit 0(AA)subscript0𝐹𝑅𝐴𝐶𝑂𝑃𝐴superscript𝐴normal-′0_{\left({{A}\atop{A^{\prime}}}\right)}0 start_POSTSUBSCRIPT ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) end_POSTSUBSCRIPT and multiplication +(AA)subscript𝐹𝑅𝐴𝐶𝑂𝑃𝐴superscript𝐴normal-′+_{\left({{A}\atop{A^{\prime}}}\right)}+ start_POSTSUBSCRIPT ( FRACOP start_ARG italic_A end_ARG start_ARG italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG ) end_POSTSUBSCRIPT below.

  • Unit. The action on the unit unpacks to the pair (F(0A)F(!1×A))\left({{F(0_{A})}\atop{F(!_{1\times A})}}\right)( FRACOP start_ARG italic_F ( 0 start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) end_ARG start_ARG italic_F ( ! start_POSTSUBSCRIPT 1 × italic_A end_POSTSUBSCRIPT ) end_ARG ). By preservation of additive and terminal maps of F𝐹Fitalic_F this morphism is equal to the unit of (F(A)F(A))FRACOP𝐹𝐴𝐹𝐴\left({{F(A)}\atop{F(A)}}\right)( FRACOP start_ARG italic_F ( italic_A ) end_ARG start_ARG italic_F ( italic_A ) end_ARG );

  • Multiplication. The action on the multiplication unpacks to the pair (F(+A)F(π3;ΔA))FRACOP𝐹subscript𝐴𝐹subscript𝜋3subscriptΔ𝐴\left({{F(+_{A})}\atop{F(\pi_{3};\Delta_{A})}}\right)( FRACOP start_ARG italic_F ( + start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) end_ARG start_ARG italic_F ( italic_π start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ; roman_Δ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) end_ARG ). By preservation of coadditive maps and products of F𝐹Fitalic_F this morphism is equal to the multiplication of (F(A)F(A))FRACOP𝐹𝐴𝐹superscript𝐴\left({{F(A)}\atop{F(A^{\prime})}}\right)( FRACOP start_ARG italic_F ( italic_A ) end_ARG start_ARG italic_F ( italic_A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ).

Seeing as these monoids in the codomain are of the same form as those in the domain, it is routine to show that they obey the monoid laws. This concludes the proof that 𝐋𝐞𝐧𝐬A(F)subscript𝐋𝐞𝐧𝐬𝐴𝐹\mathbf{Lens}_{A}(F)bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( italic_F ) is a cartesian left-additive functor.

What remains to show is that 𝐋𝐞𝐧𝐬Asubscript𝐋𝐞𝐧𝐬𝐴\mathbf{Lens}_{A}bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT preserves identities and composition, which follows routinely, concluding the proof of (Thm. 2).

This functor has additional structure — it is copointed.101010Despite this the functor 𝐋𝐞𝐧𝐬Asubscript𝐋𝐞𝐧𝐬𝐴\mathbf{Lens}_{A}bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT does not have the comonad structure, for similar reasons that tangent categories do not.

Proposition 8 (Copointed structure of 𝐋𝐞𝐧𝐬Asubscript𝐋𝐞𝐧𝐬𝐴\mathbf{Lens}_{A}bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT).

There is a natural transformation ϵ:𝐋𝐞𝐧𝐬A𝑖𝑑𝐂𝐋𝐀𝐂𝐚𝐭normal-:italic-ϵnormal-⇒subscript𝐋𝐞𝐧𝐬𝐴subscript𝑖𝑑𝐂𝐋𝐀𝐂𝐚𝐭\epsilon:\mathbf{Lens}_{A}\Rightarrow\text{id}_{\mathbf{CLACat}}italic_ϵ : bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⇒ id start_POSTSUBSCRIPT bold_CLACat end_POSTSUBSCRIPT which on components assigns to cartesian left-additive category 𝒞𝒞\operatorname{\mathcal{C}}caligraphic_C a cartesian left-additive functor ϵ𝒞:𝐋𝐞𝐧𝐬A(𝒞)𝒞normal-:subscriptitalic-ϵ𝒞normal-→subscript𝐋𝐞𝐧𝐬𝐴𝒞𝒞\epsilon_{\operatorname{\mathcal{C}}}:\mathbf{Lens}_{A}(\operatorname{\mathcal% {C}})\to\operatorname{\mathcal{C}}italic_ϵ start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT : bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ) → caligraphic_C which forgets the backward part of the lens.

\FiveOutOfSeven

*

Proof B.3.

We have shown how a putative reverse derivative combinator arises out of the functor 𝐑𝒞:𝒞𝐋𝐞𝐧𝐬A(𝒞)normal-:subscript𝐑𝒞normal-→𝒞subscript𝐋𝐞𝐧𝐬𝐴𝒞\mathbf{R}_{\operatorname{\mathcal{C}}}:\operatorname{\mathcal{C}}\to\mathbf{% Lens}_{A}(\operatorname{\mathcal{C}})bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT : caligraphic_C → bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( caligraphic_C ). What remains to prove is that this combinator satisfies the first five axioms of a CRDC.

  1. (1)

    Additivity of reverse differentiation. This is recovered by 𝐑𝒞subscript𝐑𝒞\mathbf{R}_{\operatorname{\mathcal{C}}}bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT preserving left-additive structure.

  2. (2)

    Additivity of reverse derivative in the second variable. This is recovered by definition of 𝐋𝐞𝐧𝐬Asubscript𝐋𝐞𝐧𝐬𝐴\mathbf{Lens}_{A}bold_Lens start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT — the backward maps are additive in the 2nd component.

  3. (3)

    Coherence with identities and projections. Coherence with identities is recovered by preservation of identities of the functor 𝐑𝒞subscript𝐑𝒞\mathbf{R}_{\operatorname{\mathcal{C}}}bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT, where for every X:𝒞:𝑋𝒞X:\operatorname{\mathcal{C}}italic_X : caligraphic_C, 𝐑𝒞(𝑖𝑑X)=𝑖𝑑𝐑𝒞(X)=(𝑖𝑑X,π2:X×XX)\mathbf{R}_{\operatorname{\mathcal{C}}}(\text{id}_{X})=\text{id}_{\mathbf{R}_{% \operatorname{\mathcal{C}}}(X)}=(\text{id}_{X},\pi_{2}:X\times X\to X)bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( id start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ) = id start_POSTSUBSCRIPT bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_X ) end_POSTSUBSCRIPT = ( id start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_X × italic_X → italic_X ). Coherence with projections is recovered by 𝐑𝒞subscript𝐑𝒞\mathbf{R}_{\operatorname{\mathcal{C}}}bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT preserving cartesian structure.

  4. (4)

    Coherence with pairings. Recovered by 𝐑𝒞subscript𝐑𝒞\mathbf{R}_{\operatorname{\mathcal{C}}}bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT preserving cartesian structure.

  5. (5)

    Reverse chain rule. Recovered by functoriality of 𝐑𝒞subscript𝐑𝒞\mathbf{R}_{\operatorname{\mathcal{C}}}bold_R start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT.