11institutetext: University of St. Gallen

Addressing the Subsumption Thesis: A Formal Bridge between Microeconomics and Active Inference

Noé Kuhn
Abstract

As a unified theory of sentient behaviour, active inference is formally intertwined with multiple normative theories of optimal behaviour. Specifically, we address what we call the subsumption thesis: The claim that expected utility from economics, as an account of agency, is subsumed by active inference. To investigate this claim, we present multiple examples that challenge the subsumption thesis. To formally compare these two accounts of agency, we analyze the objective functions for MDPs and POMDPs. By imposing information-theoretic rationality bounds (ITBR) on the expected utility agent, we find that the resultant agency is equivalent to that of active inference in MDPs, but slightly different in POMDPs. Rather than being strictly resolved, the subsumption thesis motivates the construction of a formal bridge between active inference and expected utility. This highlights the necessary formal assumptions and frameworks to make these disparate accounts of agency commensurable.

Keywords:
Active Inference Expected Utility Information-Theoretic Bounded Rationality Microeconomics

1 Introduction

Since the middle of the previous century, expected utility has formed the bedrock of the agency underwriting microeconomics. With early implementations dating back to Bernoulli in 1713 [37], expected utility has undergone many augmentations in order to reflect realistic deliberate decision processes. The comprehensive start of this lineage can be traced to the classic utility theorem [24]; [8], with earlier applications found in [31]. Subsequent accounts include Bayesian Decision Theory [34]; [6], Bounded Rationality [36]; [26], Prospect Theory [19], and many more flavours. The algorithmic implementation of expected utility theory is found in the Reinforcement Learning literature [3]. While seemingly disparate, practically all expected utility accounts of agency depict an agent making decisions in a probabilistic setting to attain optimal reward – to pursue utility [5].
Coming from the completely different background of neuroscience, Active Inference a comparatively new account of agency [12], positioning itself as “a unifying perspective on action and perception […] richer than the common optimization objectives used in other formal frameworks (e.g., economic theory and reinforcement learning)” [29, pg. 1;4]. Here the agent seeks to minimize information-theoretic surprisal expressed as free energy (See Definition 3, 4). Active inference allows for a realistic modeling of the very neuronal processes underwriting biological agency [27].
Given the breadth of successful applications [9] combined with its strong fundamental first principles [13], some proponents of active inference posited what we call the Subsumption thesis: Expected utility theory as seen in economics is subsumed by active inference – it is an edge case. A formulation in the same vein posits: “Active inference […] englobes the principles of expected utility theory […] it is theoretically possible to rewrite any RL algorithm […] as an active inference algorithm” [11]. So how does the subsumption thesis hold up in the given examples? Is it possible to formally delineate how expected utility and active inference differ? This paper then establishes a firm connection between microeconomics and active inference, which has scarcely been explored before [17].
To formally compare the two accounts of agency, we require a commensurable space for agent-environment interactions: MDPs and POMDPs (Definition 1 and 2). These agent-environment frameworks are the bread and butter of expected utility applications [4], [3], [20]. Active inference agency has more recently also been specified for the same frameworks [28], [9], [10], [11]. As such (PO)MDPs provide a theoretical arena for the subsumption thesis to be evaluated.
What exactly is at stake that motivates this inquiry into the subsumption thesis? Firstly, expected utility and active inference rest upon different first principles to substantiate their respective account of agency [11]. Analysis of the formal relationship between these two accounts could provide insights into how the first principles of one account might be a specification the other’s first principles. Secondly, this inquiry will shed light on how each account handles the exploration-exploitation dilemma [7]: How should an agent prioritise between exploring an environment versus exploiting what they already know about the environment for utility? Finally, if active inference truly subsumed expected utility, then the ramifications for welfarist economics would be enormous: Currently, the formal mainstream understanding of welfare which informs economic policy [30] is based on aggregating individual agents acting according to expected utility [23, pg. 45] [32]. The subsumption thesis challenges foundations of ‘optimal’ economic policy if expected utility only captures a sliver of ‘optimal’ behaviour.
To investigate the subsumption thesis, the rest of the paper is structured as follows. In section 2. the agent-environment frameworks are defined alongside the relevant accounts of agency and basic concepts in microeconomics. In section 3., some examples are investigated which challenge the subsumption thesis. In section 4., the formal bridge between expected utility and active inference is established via Information Theoretic Bounded Rationality (ITBR) [26]. Finally section 5. provides some concluding and summarizing remarks.

2 Preliminary Definitions and Microeconomics

2.1 Agent-Environment Frameworks

A finite Markov Decision Process (MDP) is a mathematical model that specifies the elements involved in agent-environment interaction and development [3]. This formalization of sequential decision making towards reward maximization originates in dynamic programming, and currently enjoys much popularity in model-based Reinforcement Learning (RL). Although potentially reductive, employing MDPs and POMDPs allows for formal commensurability between different accounts of agency.

Definition 1 (Finite Horizon MDP). An MDP is defined according to the following given tuple: (𝕊,𝔸,P(s|a,s),R(s,a),γ=1,𝕋)formulae-sequence𝕊𝔸𝑃conditionalsuperscript𝑠𝑎𝑠𝑅superscript𝑠𝑎𝛾1𝕋(\mathbb{S},\mathbb{A},P(s^{\prime}|a,s),R(s^{\prime},a),\gamma=1,\mathbb{T})( blackboard_S , blackboard_A , italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_a , italic_s ) , italic_R ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) , italic_γ = 1 , blackboard_T )

  • 𝕊𝕊\mathbb{S}blackboard_S is a finite set of states.

  • 𝔸𝔸\mathbb{A}blackboard_A is a finite set of actions.

  • P(s|a,s)𝑃conditionalsuperscript𝑠𝑎𝑠P(s^{\prime}|a,s)italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_a , italic_s ) is the transition probability of posterior state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT occurring upon the agent’s selection of action a𝑎aitalic_a in the prior state s𝑠sitalic_s.

  • R(s,a)+𝑅superscript𝑠𝑎superscriptR(s^{\prime},a)\in\mathbb{R}^{+}italic_R ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is the reward function taking as arguments the agent’s action and resulting state. For our purposes, the action taken will be irrelevant to the resulting reward: R(s,a)=R(s)𝑅superscript𝑠𝑎𝑅superscript𝑠R(s^{\prime},a)=R(s^{\prime})italic_R ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a ) = italic_R ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ).

  • γ𝛾\gammaitalic_γ denotes the discount factor of future rewards. This is set to 1111 as this parameter is not commonly used in the cited active inference literature.

  • 𝕋={1,2,,t,,τ,,T}𝕋12𝑡𝜏𝑇\mathbb{T}=\{1,2,\ldots,t,\ldots,\tau,\ldots,T\}blackboard_T = { 1 , 2 , … , italic_t , … , italic_τ , … , italic_T } is a finite set for discrete time periods whereby t<τ𝑡𝜏t<\tauitalic_t < italic_τ and the horizon is T𝑇Titalic_T.

Note that time period subscripts e.g, sτsubscript𝑠𝜏s_{\tau}italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT are sometimes omitted when unnecessary.

In a single-step decision problem, an expected reward-maximizing agent would evaluate the optimal action asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as follows:

at=argmaxa𝔸EP(sτ|at,st)R(sτ)superscriptsubscript𝑎𝑡𝑎𝔸subscript𝐸𝑃conditionalsubscript𝑠𝜏subscript𝑎𝑡subscript𝑠𝑡𝑅subscript𝑠𝜏a_{t}^{*}=\underset{a\in\mathbb{A}}{\arg\max}\ E_{P(s_{\tau}|a_{t},s_{t})}R(s_% {\tau})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_a ∈ blackboard_A end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_E start_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_R ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) (1)

Further, a Partially Observable Markov Decision Process (POMDP) generalizes an MDP by introducing observations o𝑜oitalic_o that contain incomplete information about the latent state s𝑠sitalic_s of the environment [20, 3]. The agent can only infer latent states via observations. Thus, POMDPs are ideal for modeling action-perception cycles [13] with the cyclical causal graphical model aso𝑎𝑠𝑜a\rightarrow s\rightarrow o\ldotsitalic_a → italic_s → italic_o …

Definition 2 (Finite Horizon POMDP). A finite horizon POMDP further adds two elements to the previously given MDP tuple: (𝕆,P(o|s))𝕆𝑃conditional𝑜𝑠(\mathbb{O},P(o|s))( blackboard_O , italic_P ( italic_o | italic_s ) )

  • 𝕆𝕆\mathbb{O}blackboard_O is a finite set of observations.

  • P(o|s)𝑃conditional𝑜𝑠P(o|s)italic_P ( italic_o | italic_s ) is the probability of observation o𝑜oitalic_o occuring to the agent given the state s𝑠sitalic_s.

2.2 Active Inference Agency

With the environment-agent frameworks established, we can proceed to define how an active inference agent approaches a (PO)MDP. Although fundamental and interesting, the Variational Free Energy objective crucial to perception in active inference will not be examined here; Inference on latent states is assumed to occur through exact Bayesian inference [10, pg. 16]. The central objective function for agency in active inference is the Expected Free Energy (EFE), the formulation of which for (PO)MDPs we will take from [9],[11],[10],[28]. Essentially, the agent takes the action trajectory π={aτ,,aT}𝜋subscript𝑎𝜏subscript𝑎𝑇\pi=\{a_{\tau},\ldots,a_{T}\}italic_π = { italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } that minimizes the cumulative expected free energy G𝐺Gitalic_G, which is roughly the sum of the single-step EFEs Gτsubscript𝐺𝜏G_{\tau}italic_G start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT. By inferring the resultant EFE of policies through Q()𝑄Q(\cdot)italic_Q ( ⋅ ), the optimal trajectory πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT corresponds to the most likely trajectory – the path of least action. [13]. Formally:

π=argmin𝜋G(π)superscript𝜋𝜋𝐺𝜋\displaystyle\pi^{*}=\underset{\pi}{\arg\min}\ G(\pi)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = underitalic_π start_ARG roman_arg roman_min end_ARG italic_G ( italic_π ) (2)
G(π)τTGτ(π)𝐺𝜋superscriptsubscript𝜏𝑇subscript𝐺𝜏𝜋\displaystyle G(\pi)\approx\sum\limits_{\tau}^{T}G_{\tau}(\pi)italic_G ( italic_π ) ≈ ∑ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_π )
Gτ(π)=G(at)\displaystyle G_{\tau}(\pi)=G_{(}a_{t})italic_G start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_π ) = italic_G start_POSTSUBSCRIPT ( end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

We can then define the EFE for single-step for MDPs and POMDPs. Note that this could also be scaled up to trajectories/vectors of the relevant elements e.g st:Tsubscript𝑠:𝑡𝑇s_{t:T}italic_s start_POSTSUBSCRIPT italic_t : italic_T end_POSTSUBSCRIPT. For simplicity we will look at single-step formulations for the remainder of the paper.

Definition 3: (EFE on MDPs). For an agent in an MDP with preference distribution P(s|C)𝑃conditional𝑠𝐶P(s|C)italic_P ( italic_s | italic_C ), the Expected Free Energy of an action for some given current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is defined as follows:

Gτ(at)=DKL[P(sτ|at,st)||P(s|C)]\displaystyle G_{\tau}(a_{t})=D_{KL}[P(s_{\tau}|a_{t},s_{t})||P(s|C)]italic_G start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | italic_P ( italic_s | italic_C ) ] (3)
=[P(sτ|at,st)]Entropy of future statesEP(sτ|at)[logP(s|C)]Expected Surpriseabsentsubscriptdelimited-[]𝑃conditionalsubscript𝑠𝜏subscript𝑎𝑡subscript𝑠𝑡Entropy of future statessubscriptsubscript𝐸𝑃conditionalsubscript𝑠𝜏subscript𝑎𝑡delimited-[]𝑙𝑜𝑔𝑃conditional𝑠𝐶Expected Surprise\displaystyle=-\underbrace{\mathfrak{H}[P(s_{\tau}|a_{t},s_{t})]}_{\text{% Entropy of future states}}\underbrace{-E_{P(s_{\tau}|a_{t})}[logP(s|C)]}_{% \text{Expected Surprise}}= - under⏟ start_ARG fraktur_H [ italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] end_ARG start_POSTSUBSCRIPT Entropy of future states end_POSTSUBSCRIPT under⏟ start_ARG - italic_E start_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_l italic_o italic_g italic_P ( italic_s | italic_C ) ] end_ARG start_POSTSUBSCRIPT Expected Surprise end_POSTSUBSCRIPT

As seen in the rearranged objective function of the second line, the agent seeks to keep future options open while meeting preferences; The entropy of future possible states is to be maximized while the information-theoretic surprisal according to the preference distribution is to be minimized. The conditionalisation on C𝐶Citalic_C specifies a parameterized preference distribution [28].

Definition 4: (EFE in POMDPs). For an agent in a POMDP with preference distribution P(s|C),P(o|C)𝑃conditional𝑠𝐶𝑃conditional𝑜𝐶P(s|C),P(o|C)italic_P ( italic_s | italic_C ) , italic_P ( italic_o | italic_C ), the Expected Free Energy of an action for some given current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is defined as follows:

Gτ(at)=EP(sτ|ot,at)[P(oτ|sτ)]Ambiguity+DKL[P(sτ|at,ot)|P(s|C)]Risk\displaystyle G_{\tau}(a_{t})=\underbrace{E_{P(s_{\tau}|o_{t},a_{t})}\mathfrak% {H}[P(o_{\tau}|s_{\tau})]}_{\text{Ambiguity}}+\underbrace{D_{KL}[P(s_{\tau}|a_% {t},o_{t})|P(s|C)]}_{\text{Risk}}italic_G start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = under⏟ start_ARG italic_E start_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT fraktur_H [ italic_P ( italic_o start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) ] end_ARG start_POSTSUBSCRIPT Ambiguity end_POSTSUBSCRIPT + under⏟ start_ARG italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | italic_P ( italic_s | italic_C ) ] end_ARG start_POSTSUBSCRIPT Risk end_POSTSUBSCRIPT (4)
=EP(oτ|at)[DKL[P(sτ|oτ)||P(sτ|at)]]Intrinsic ValueEP(oτ,sτ|at)[logP(o|C)]Extrinsic Value\displaystyle=-\underbrace{E_{P(o_{\tau}|a_{t})}[D_{KL}[P(s_{\tau}|o_{\tau})||% P(s_{\tau}|a_{t})]]}_{\text{Intrinsic Value}}-\underbrace{E_{P(o_{\tau},s_{% \tau}|a_{t})}[logP(o|C)]}_{\text{Extrinsic Value}}= - under⏟ start_ARG italic_E start_POSTSUBSCRIPT italic_P ( italic_o start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_o start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) | | italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ] end_ARG start_POSTSUBSCRIPT Intrinsic Value end_POSTSUBSCRIPT - under⏟ start_ARG italic_E start_POSTSUBSCRIPT italic_P ( italic_o start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_l italic_o italic_g italic_P ( italic_o | italic_C ) ] end_ARG start_POSTSUBSCRIPT Extrinsic Value end_POSTSUBSCRIPT

With some auxiliary assumptions [11, pg. 10] which are admissible for our purposes, the two formulations of EFE in a POMDP are equivalent, and both contain a curiosity inducing term and an exploitation term [15]. The first formulation motivates the agent to minimize the expected entropy of observations given unknown states and to minimize the divergence between actual states and preferred states. The second formulation motivates the agent to maximize the expected informational value of observations while also maximizing the expected log probability of preferred observations – note how the underbrace does not include the minus.

2.3 Microeconomics

As this paper investigates an intersection between fields which are generally not in direct contact, a brief introduction to risk attitudes and lotteries in microeconomics is provided. The origin of these studies can be traced back to the gambling houses of the 18th century; As early as 1713, Bernoulli employed marginally decreasing utility functions to resolve the famous St. Petersburg Paradox [37]. This paradox asks what amount a rational agent would be willing to pay to enter lottery with an infinite expected value. To answer this question, we utilize lotteries [8] and risk-attitudes [1] from microeconomics:
Definition 5: (Lottery). A (monetary) lottery is a probability distribution over outcomes x𝑥xitalic_x that are the argument of the utility function. Therefore, a lottery L𝐿Litalic_L can be modelled as an integrable random variable defined by the probability space triplet consisting of a sample space, sigma algebra, and probability measure: (Ω,𝔉,μ)Ω𝔉𝜇(\Omega,\mathfrak{F},\mu)( roman_Ω , fraktur_F , italic_μ )

A decision maker then evaluates their preference over a set of lotteries according to their utility function U(x)+𝑈𝑥superscriptU(x)\in\mathbb{R}^{+}italic_U ( italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT where x𝔉𝑥𝔉x\in\mathfrak{F}italic_x ∈ fraktur_F. The expected utility of each lottery then induces a preference ordering over lotteries. For example, the strong preference relation L1L2succeedssubscript𝐿1subscript𝐿2L_{1}\succ L_{2}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT means that lottery L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is more preferable to lottery L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Classically, this ordering is in line with the von Neumann-Morgenstern axioms of completeness, transitivity, continuity, and independence [24]. Any such preference ordering is also maintained for any positive affine transformation of U(x)𝑈𝑥U(x)italic_U ( italic_x ) [8]; [24]. By juxtaposing the expected utility E[U(L)]𝐸delimited-[]𝑈𝐿E[U(L)]italic_E [ italic_U ( italic_L ) ] of a lottery against the utility of the expectation of the same lottery U(E[L])𝑈𝐸delimited-[]𝐿U(E[L])italic_U ( italic_E [ italic_L ] ), risk aversion can be defined.
Definition 6: (Risk Aversion). An agent with some utility function U()𝑈U(\cdot)italic_U ( ⋅ ) is considered risk averse if for some lottery the following preference relation holds: U(E[L])U(L)succeeds𝑈𝐸delimited-[]𝐿𝑈𝐿U(E[L])\succ U(L)italic_U ( italic_E [ italic_L ] ) ≻ italic_U ( italic_L )

This preference relation occurs if an agent’s utility function is concave, i.e the marginal utility is decreasing. A risk loving agent conversely acts according to a convex utility function, and a risk neutrality is associated with a linear utility function. Accordingly, Bernoulli used lotteries and a log-utility function to resolve the St. Petersburg paradox, the solution of which is relegated to Appendix A for readers unfamiliar with the problem – the pertinent point is that concave utility functions on set rewards are extensively studied in economics.

3 Subsumption Examples

Equipped with an understanding of marginal utility and lotteries, we can now tackle two manifest exhibits of the subsumption thesis by proponents of active inference. Further, an illustrative MDP demonstrates the divergence in behaviour between active inference and expected utility. The results of the simulated behaviour are directly taken from the discussed papers. These exhibits then motivate the bridging in section 4 later.
The first [33] and second [11] exhibit both concern agency in a classical T-maze: A simple forked pathway in which the agent can either go left or right (See Figure 1 below). This environment is also called “Light” in POMDP literature [20]. The agent-environment dynamics are modeled using a POMDP; Unbeknownst to the agent, the reward is either in the left or right arm. The agent can also go down to observe a cue indicating the definite location of the reward. Going down the ‘wrong’ arm of the fork leads to a punishment equal to the negative reward, say 11-1- 1. The performance of the agency is evaluated by the reward attainment of the agent within a two period horizon. At this point however, the setup of the first and second exhibit diverge crucially.

Refer to caption
Figure 1: An agent in a T-Maze with unknown context. Illustration from [33]

In the first exhibit [33], the right and left fork are absorbing states – the agent cannot leave them upon entry. As such, the agent cannot correct going down the wrong arm in the first period by the second period. Given this setup, the expected utility agent performs very poorly, while the active inference is cue-seeking and therefore performs optimally [33, pg. 138]. The expected utility agent performs so poorly because supposedly “the agent does not care about the information inferred and is indifferent about going to the cue location or remaining at the central location”[33, pg. 137]. This appears reductive, as an expected utility agent facing two lotteries will behave the same as the active inference agent. Consider the risky lottery L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT which is the result of a gambling and non-information seeking strategy. Contrast this lottery with L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which is the degenerate lottery of investigating the cue first and going to the reward in the second period. Assuming even just a linear utility function U(R)=R(s)𝑈𝑅𝑅𝑠U(R)=R(s)italic_U ( italic_R ) = italic_R ( italic_s ), then U(L1)=0.51+0.51=0U(L_{1})=0.5\cdot 1+0.5\cdot-1=0italic_U ( italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 0.5 ⋅ 1 + 0.5 ⋅ - 1 = 0 and U(L2)=11𝑈subscript𝐿211U(L_{2})=1\cdot 1italic_U ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 ⋅ 1. Clearly, the expected utility agent holds a preference which motivates cue-seeking behaviour: L2L1succeedssubscript𝐿2subscript𝐿1L_{2}\succ L_{1}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≻ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.
Regarding the second exhibit [11], there is a slight difference in the setup. The arms of the fork are no longer absorbing states, which allows for mistake correction and a cumulative reward of 2222 over two periods. Now, the focus of [11] isn’t anymore on performance comparison but instead achieving the desiderata of risk-aversion and information sensitivity [11, pg. 10]. While the agency according to active inference meets the desiderata, the expected utility agent does not. However, risk aversion and the resulting information sensitivity can easily be induced by using a concave utility function. Consider again the risky lottery L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a cue-seeking lottery L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Assuming a utility function taking the reward as argument U(R)=R(s)c𝑈𝑅𝑅superscript𝑠𝑐U(R)=R(s)^{c}italic_U ( italic_R ) = italic_R ( italic_s ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT where c+𝑐superscriptc\in\mathbb{R}^{+}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, then U(L1)=0.50+0.52c𝑈subscript𝐿10.500.5superscript2𝑐U(L_{1})=0.5\cdot 0+0.5\cdot 2^{c}italic_U ( italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = 0.5 ⋅ 0 + 0.5 ⋅ 2 start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT and U(L2)=1c𝑈subscript𝐿2superscript1𝑐U(L_{2})=1^{c}italic_U ( italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. Accordingly if c<1𝑐1c<1italic_c < 1, then L2L1succeedssubscript𝐿2subscript𝐿1L_{2}\succ L_{1}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≻ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and only if c=1𝑐1c=1italic_c = 1, then indeed the agent is indifferent L1L2similar-tosubscript𝐿1subscript𝐿2L_{1}\sim L_{2}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. As is evident, it is the risk-neutral agent who does not meet the desiderata.
Finally, consider the following single-step MDP created for illustrative purposes. A paraglider stands at the foot of two steep mountains s1,s2subscript𝑠1subscript𝑠2s_{1},s_{2}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT separated by a chasm s3subscript𝑠3s_{3}italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and must decide which one to climb. While still risky, the path up mountain 1111 is far more secure than the path up mountain 3333. However, mountain 2222 is taller than mountain 1111 and therefore allows for a more enjoyable flight. This decision process can aptly be modeled in an MDP (See Figure 2 below). Note that the subscript here does not relate to the period. Taking a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT gives {P(s1|a1),P(s2|a1),P(s3|a1)}={0.6,0,0.4}𝑃conditionalsubscript𝑠1subscript𝑎1𝑃conditionalsubscript𝑠2subscript𝑎1𝑃conditionalsubscript𝑠3subscript𝑎10.600.4\{P(s_{1}|a_{1}),P(s_{2}|a_{1}),P(s_{3}|a_{1})\}=\{0.6,0,0.4\}{ italic_P ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_P ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_P ( italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) } = { 0.6 , 0 , 0.4 }, and a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT gives {P(s1|a2),P(s2|a2),\{P(s_{1}|a_{2}),P(s_{2}|a_{2}),{ italic_P ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_P ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , P(s3|a2)}={0,0.4,0.6}P(s_{3}|a_{2})\}=\{0,0.4,0.6\}italic_P ( italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) } = { 0 , 0.4 , 0.6 }. The height in kilometers gives the reward function {R(s1),R(s2),R(s3)}𝑅subscript𝑠1𝑅subscript𝑠2𝑅subscript𝑠3\{R(s_{1}),R(s_{2}),R(s_{3})\}{ italic_R ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_R ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_R ( italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) } ={1,1.5,0}absent11.50=\{1,1.5,0\}= { 1 , 1.5 , 0 }. With the MDP sufficiently specified, we can compare the agency of an active inference agent and an expected utility agent. See Appendix B for details on the resulting expected utility and free energy.

Refer to caption
Figure 2: ‘Paraglider’ MPD with states, actions, transition probabilities, and rewards

The active inference agent is indifferent between the two actions as both actions result in the same EFE – equation (3). For expected utility however, only the linear utility function agent is indifferent; The risk averse agent prefers the safer mountain and the risk loving agent prefers the riskier mountain due to the concavity or convexity of the utility function respectively. As such, this simple but valid MDP provides a setup in which specified expected utility may better meet the desiderata than the active inference agent.

It should be clear by now that wrapping a utility function around the rewards is a well-studied and principled approach which differs from simply including “ad-hoc exploration bonuses in the reward function” [11, pg. 2]. Introducing non-linearity over the rewards seems to lead to an impasse in the comparison between expected utility. The most direct case for comparing and subsuming expected utility [10] only considers a linear utility function (U()=R()𝑈𝑅U(\cdot)=R(\cdot)italic_U ( ⋅ ) = italic_R ( ⋅ )) for the expected utility agent. Even if non-linearity for expected utility were considered in [10], it appears unclear to us as to how the resulting agency – more specifically the induced careful and explorative aspect – could be compared in a generalized manner.
To resolve this issue of incommensurability, we would like to draw attention to physical and biological constraints on agents which have motivated active inference. For example, tractability is a central concern for active inference as evidenced by the appeal to variational Bayes. Luckily, there already exists an account of agency which imbues expected utility with constraints: ITBR [26],[16]. The connection between ITBR and active inference in an MDP has briefly been explored before [25]. We seek to now clearly establish this conceptual bridge between microeconomics and active inference for both MDPs and POMDPs.

4 From expected utility to active inference via ITBR

4.1 In MDPs

Let us first establish the bridge between expected utility and active inference in an MDP. Essentially, both objective functions can both be transformed into the “Divergence Objective” [21]:

a=argmina𝔸DKL[P(sτ|at)||P(s)]a^{*}=\underset{a\in\mathbb{A}}{\arg\min}D_{KL}[P(s_{\tau}|a_{t})||P^{*}(s)]italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_a ∈ blackboard_A end_UNDERACCENT start_ARG roman_arg roman_min end_ARG italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ] (5)

Where asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal action and P(s)superscript𝑃𝑠P^{*}(s)italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) is a preference distribution over states, for example, a softmax or Gibbs distribution. Note the immediate similarity to the EFE objective function for MDPs (3) – here conditionalisation on the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is omitted for brevity as we consider a single step.
To get there from expected utility, we can consider the following Lagrangian constraints on the utility objective function [16, pg. 3]. Let P()𝑃P(\cdot)italic_P ( ⋅ ) be the prior distribution over relevant elements of the MDP, and Q()𝑄Q(\cdot)italic_Q ( ⋅ ) the posterior distribution after a limited search or ‘bounded deliberation; see [26] for details. The deliberation bound is given as an information-theoretic quantity e.g nats𝑛𝑎𝑡𝑠natsitalic_n italic_a italic_t italic_s or bits𝑏𝑖𝑡𝑠bitsitalic_b italic_i italic_t italic_s; hence the name information-theoretic bounded rationality. Let K+nat𝐾superscript𝑛𝑎𝑡K\in\mathbb{R}^{+}\ natitalic_K ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_n italic_a italic_t – although the information theoretic unit base nat𝑛𝑎𝑡natitalic_n italic_a italic_t is arbitrary:

DKL[Q(sτ|at)||P(sτ|at)]K\displaystyle D_{KL}[Q(s_{\tau}|a_{t})||P(s_{\tau}|a_{t})]\leq Kitalic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_Q ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ≤ italic_K (6)

The constraint of equation 6 can be interpreted as a bound on the search for the optimal action. The second constraint of (6) means that the agent is uncertain about the ‘true’ transition probabilities in the MDP. This constraint gives us the following ITBR free energy objective function [25]:

FITBR(Q)=sQ(s|a)(U(s,a)1βlogQ(s|a)P(s|a))subscript𝐹𝐼𝑇𝐵𝑅𝑄subscript𝑠𝑄conditional𝑠𝑎𝑈𝑠𝑎1𝛽𝑙𝑜𝑔𝑄conditional𝑠𝑎𝑃conditional𝑠𝑎F_{ITBR}(Q)=\sum\limits_{s}Q(s|a)\ \left(U(s,a)-\frac{1}{\beta}log\frac{Q(s|a)% }{P(s|a)}\right)italic_F start_POSTSUBSCRIPT italic_I italic_T italic_B italic_R end_POSTSUBSCRIPT ( italic_Q ) = ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_Q ( italic_s | italic_a ) ( italic_U ( italic_s , italic_a ) - divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_l italic_o italic_g divide start_ARG italic_Q ( italic_s | italic_a ) end_ARG start_ARG italic_P ( italic_s | italic_a ) end_ARG ) (7)

This functional is to be maximized (Q(s|a)superscript𝑄conditional𝑠𝑎Q^{*}(s|a)italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s | italic_a )) with given parameter β+𝛽superscript\beta\in\mathbb{R}^{+}italic_β ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. See Appendix C for how the maximizing solution is derived. We can now use the maximizing argument of the objective function (7) as a ‘goal’ for the agent, or a preference distribution over states P(s)superscript𝑃𝑠P^{*}(s)italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ). Like in active inference, we assume that the preference distribution over states is independent of the action taken to get there. This preference is given by the Gibbs distribution:

P(s|a)=P(s|a)eβU(s,a)ZβP(s)superscript𝑃conditional𝑠𝑎𝑃conditional𝑠𝑎superscript𝑒𝛽𝑈𝑠𝑎subscript𝑍𝛽superscript𝑃𝑠P^{*}(s|a)=\frac{P(s|a)\cdot e^{\beta U(s,a)}}{Z_{\beta}}\rightarrow P^{*}(s)italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s | italic_a ) = divide start_ARG italic_P ( italic_s | italic_a ) ⋅ italic_e start_POSTSUPERSCRIPT italic_β italic_U ( italic_s , italic_a ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_Z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_ARG → italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) (8)

We can now solve (8) for U(s,a)𝑈𝑠𝑎U(s,a)italic_U ( italic_s , italic_a ) and input this into (7) to obtain the divergence objective (5):

a=argmina𝔸DKL[Q(sτ|at)||P(s)]+constanta^{*}=\underset{a\in\mathbb{A}}{\arg\min}-D_{KL}[Q(s_{\tau}|a_{t})||P^{*}(s)]+constantitalic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_a ∈ blackboard_A end_UNDERACCENT start_ARG roman_arg roman_min end_ARG - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_Q ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) ] + italic_c italic_o italic_n italic_s italic_t italic_a italic_n italic_t (9)

Where the constant is irrelevant for optimization purposes. The details of this derivation relegated to Appendix D. Evidently, the same optimal agency arises in an MDP for an active inference and ITBR agent. Next, let us bridge expected utility to active inference in a POMDP.

4.2 In POMDPs

Analogously to the MDP setting, we can transform the ITBR objective to get to the divergence objective function for POMDPs. Fortunately, this divergence objective has previously been formulated as the “Free Energy of the Expected Future” (FEEF): [22, pg. 10]. Again, this objective function motivates a minimal posterior divergence from a preference distribution, now jointly over states and observations:

a=argmina𝔸DKL[P(oτ,sτ|at)||P(o,s)]a^{*}=\underset{a\in\mathbb{A}}{\arg\min}D_{KL}[P(o_{\tau},s_{\tau}|a_{t})||P^% {*}(o,s)]italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_a ∈ blackboard_A end_UNDERACCENT start_ARG roman_arg roman_min end_ARG italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_P ( italic_o start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_o , italic_s ) ] (10)

To attain this expression, we can formulate a new ITBR objective in the POMDP framework [16] and transform it analogously to the MDP case before. We can again consider the information-theoretic bound V+nat𝑉superscript𝑛𝑎𝑡V\in\mathbb{R}^{+}natitalic_V ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_n italic_a italic_t :

DKL[Q(sτ,oτ|at)||P(sτ,oτ|at)]V\displaystyle D_{KL}[Q(s_{\tau},o_{\tau}|a_{t})||P(s_{\tau},o_{\tau}|a_{t})]\leq Vitalic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_Q ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] ≤ italic_V (11)

Considering these constraints, we can express the ITBR Free energy objective function again:

FITBR(Q)=sQ(o,s|a)(U(o,s,a)1βlogQ(o,s|a)P(o,s|a))subscript𝐹𝐼𝑇𝐵𝑅𝑄subscript𝑠𝑄𝑜conditional𝑠𝑎𝑈𝑜𝑠𝑎1𝛽𝑙𝑜𝑔𝑄𝑜conditional𝑠𝑎𝑃𝑜conditional𝑠𝑎\displaystyle F_{ITBR}(Q)=\sum\limits_{s}Q(o,s|a)\left(U(o,s,a)-\frac{1}{\beta% }log\frac{Q(o,s|a)}{P(o,s|a)}\right)italic_F start_POSTSUBSCRIPT italic_I italic_T italic_B italic_R end_POSTSUBSCRIPT ( italic_Q ) = ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_Q ( italic_o , italic_s | italic_a ) ( italic_U ( italic_o , italic_s , italic_a ) - divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_l italic_o italic_g divide start_ARG italic_Q ( italic_o , italic_s | italic_a ) end_ARG start_ARG italic_P ( italic_o , italic_s | italic_a ) end_ARG ) (12)

Where the solution is again the Gibbs distribution:

P(o,s|a)=P(o,s|a)eβU(o,s,a)Zβsuperscript𝑃𝑜conditional𝑠𝑎𝑃𝑜conditional𝑠𝑎superscript𝑒𝛽𝑈𝑜𝑠𝑎subscript𝑍𝛽\displaystyle P^{*}(o,s|a)=\frac{P(o,s|a)e^{\beta U(o,s,a)}}{Z_{\beta}}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_o , italic_s | italic_a ) = divide start_ARG italic_P ( italic_o , italic_s | italic_a ) italic_e start_POSTSUPERSCRIPT italic_β italic_U ( italic_o , italic_s , italic_a ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_Z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_ARG (13)

By combining (13) and (12) we get the resultant minimization objective, where the resultant optimal agency is of course the same as that of the divergence minimization objective (10):

a=argmina𝔸DKL[Q(oτ,sτ|at)||P(o,s)]+constanta^{*}=\underset{a\in\mathbb{A}}{\arg\min}-D_{KL}[Q(o_{\tau},s_{\tau}|a_{t})||P% ^{*}(o,s)]+constantitalic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_UNDERACCENT italic_a ∈ blackboard_A end_UNDERACCENT start_ARG roman_arg roman_min end_ARG - italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_Q ( italic_o start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_o , italic_s ) ] + italic_c italic_o italic_n italic_s italic_t italic_a italic_n italic_t (14)

Which again intuitively motivates the agent to have the inferred posterior distribution given the action be as close as possible to the prior preference distribution over states. It is crucial to note however that this is not the same objective function as EFE in POMDPs (4)! To get from the divergence objective (10) for POMDPs to EFE (4), we can follow the steps taken in [22]; for a detailed discussion of the relationship between the divergence objective and EFE, the reader should also consult [21], [22]. Essentially, the divergence objective can also be decomposed into an exploitative and explorative term. However, while the explorative term is equal to that of active inference, the divergence objective additionally further encourages the agent to increase posterior entropy of observations given latent states – to keep options open. Note that in the formulation below, both objective functions below (15), (4) are to be minimized.

FITBR=DKL[Q(o,s|a)||P(o,s)]\displaystyle-F_{ITBR}=D_{KL}[Q(o,s|a)||P^{*}(o,s)]- italic_F start_POSTSUBSCRIPT italic_I italic_T italic_B italic_R end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_Q ( italic_o , italic_s | italic_a ) | | italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_o , italic_s ) ] (15)
=EQ(s|a)[DKL[Q(o|s)||P(o)]]Extrinsic ValueEQ(o|a)[DKL[Q(s|o)||Q(s|a)]]Intrinsic Value\displaystyle=\underbrace{E_{Q(s|a)}\left[\ D_{KL}[Q(o|s)||P^{*}(o)]\ \right]}% _{\text{Extrinsic Value}}-\underbrace{E_{Q(o|a)}\left[\ D_{KL}[Q(s|o)||Q(s|a)]% \ \right]}_{\text{Intrinsic Value}}= under⏟ start_ARG italic_E start_POSTSUBSCRIPT italic_Q ( italic_s | italic_a ) end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_Q ( italic_o | italic_s ) | | italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_o ) ] ] end_ARG start_POSTSUBSCRIPT Extrinsic Value end_POSTSUBSCRIPT - under⏟ start_ARG italic_E start_POSTSUBSCRIPT italic_Q ( italic_o | italic_a ) end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_Q ( italic_s | italic_o ) | | italic_Q ( italic_s | italic_a ) ] ] end_ARG start_POSTSUBSCRIPT Intrinsic Value end_POSTSUBSCRIPT
G=EQ(o,s|a)[logP(o|C)]Extrinsic ValueEQ(o|a)[DKL[Q(s|o)||Q(s|a)]]Intrinsic Value\displaystyle G\ =\ -\underbrace{E_{Q(o,s|a)}[logP(o|C)]}_{\text{Extrinsic % Value}}-\underbrace{E_{Q(o|a)}\left[\ D_{KL}[Q(s|o)||Q(s|a)]\ \right]}_{\text{% Intrinsic Value}}italic_G = - under⏟ start_ARG italic_E start_POSTSUBSCRIPT italic_Q ( italic_o , italic_s | italic_a ) end_POSTSUBSCRIPT [ italic_l italic_o italic_g italic_P ( italic_o | italic_C ) ] end_ARG start_POSTSUBSCRIPT Extrinsic Value end_POSTSUBSCRIPT - under⏟ start_ARG italic_E start_POSTSUBSCRIPT italic_Q ( italic_o | italic_a ) end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_Q ( italic_s | italic_o ) | | italic_Q ( italic_s | italic_a ) ] ] end_ARG start_POSTSUBSCRIPT Intrinsic Value end_POSTSUBSCRIPT (4)

Whereby the relationship between G𝐺Gitalic_G and FITBRsubscript𝐹𝐼𝑇𝐵𝑅-F_{ITBR}- italic_F start_POSTSUBSCRIPT italic_I italic_T italic_B italic_R end_POSTSUBSCRIPT is as follows:

GEQ(o|a)[Q(o|s)]=FITBR𝐺subscript𝐸𝑄conditional𝑜𝑎delimited-[]𝑄conditional𝑜𝑠subscript𝐹𝐼𝑇𝐵𝑅G-E_{Q(o|a)}\mathfrak{H}[Q(o|s)]=-F_{ITBR}italic_G - italic_E start_POSTSUBSCRIPT italic_Q ( italic_o | italic_a ) end_POSTSUBSCRIPT fraktur_H [ italic_Q ( italic_o | italic_s ) ] = - italic_F start_POSTSUBSCRIPT italic_I italic_T italic_B italic_R end_POSTSUBSCRIPT (16)

Comparing then the decomposed divergence objective to active inference, in the pursuit of extrinsic value the boundedly rational utility agent seeks to additionally keep posterior options open compared to the active inference agent – similarly to the agency in an MDP.

4.3 The Bridge summarized

Let us reconsider the entire journey from expected utility to active inference so as to not lose sight of the forest in front of all the trees. First, simply incorporate a utility function into the reward maximizing objective function (1) to get an expected utility agent. Then, impose information-theoretic deliberation constraints on the optimization process (6). Consequently, the agent faces a Lagrangian optimization problem (7). The solution to this optimization problem is taken as a preference distribution for the agent. Combining the preference distribution and the objective function results in the divergence objective [21], which can then be compared with the active inference objective function. In an MDP, the resultant agency is the exact same (5). However, in a POMDP, the objective functions differ (16).
Bar this difference, one key aspect must be elucidated for the extrinsic value terms in both MDPs and POMDPs. Although the intrinsic value term is the same for the different objective functions, the two prior preference distributions P(s)superscript𝑃𝑠P^{*}(s)italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) of ITBR (8) and P(s|C)𝑃conditional𝑠𝐶P(s|C)italic_P ( italic_s | italic_C ) of active inference (3) are not necessarily the same. For β=1𝛽1\beta=1italic_β = 1, if we consider P(s|C)𝑃conditional𝑠𝐶P(s|C)italic_P ( italic_s | italic_C ) as a Gibbs distribution as per [10, pg. 9];[28, pg. 134], then the two preference distributions are only equal if either the utility function is linear, or if active inference admits agent-specific utility functions (17); An admission which prima facie seems irreconcilable with the physicalist/nonsubjectivist philosophy behind active inference. This larger discussion is, however, to be relegated to a later paper.

P(s)=eU(s)seU(s)andP(s|C)=eR(s)seR(s)formulae-sequencesuperscript𝑃𝑠superscript𝑒𝑈𝑠subscript𝑠superscript𝑒𝑈𝑠and𝑃conditional𝑠𝐶superscript𝑒𝑅𝑠subscript𝑠superscript𝑒𝑅𝑠P^{*}(s)=\frac{e^{U(s)}}{\sum\limits_{s}e^{U(s)}}\quad\mathrm{and}\quad P(s|C)% =\frac{e^{R(s)}}{\sum\limits_{s}e^{R(s)}}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_U ( italic_s ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_U ( italic_s ) end_POSTSUPERSCRIPT end_ARG roman_and italic_P ( italic_s | italic_C ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_R ( italic_s ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_R ( italic_s ) end_POSTSUPERSCRIPT end_ARG (17)

Where optimal behaviour in an MDP, i.e asuperscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, is the same for both accounts of agency only if U(s)𝑈𝑠U(s)italic_U ( italic_s ) is a positive affine transformation of R(s)𝑅𝑠R(s)italic_R ( italic_s ).

5 Conclusion

Having formalized the bridge from expected utility to active inference, we can re-evaluate the subsumption thesis. Simple reward-oriented agency (U()=R()𝑈𝑅U(\cdot)=R(\cdot)italic_U ( ⋅ ) = italic_R ( ⋅ )) can be effectively subsumed by active inference in MDPs, and if exact Bayesian inference is used, also in POMDPs [10]. However, as shown in section 2., expected utility in microeconomics uses utility functions that take rewards as arguments. As seen in section 3. then, there are various examples where the subsumption argument does not hold up; Expected utility acts the same as active inference, or under specific circumstances, may meet desiderata of agency even more. In section 4., we establish the formal bridge between expected utility and active inference. By using ITBR [26], we can directly compare the objective functions of (bounded) expected utility and active inference. Upon considering agent-environment assumptions, the divergence objective [21] is used as a reference point to compare the two accounts of agency. It is demonstrated that in an MDP, ITBR and active inference lead to the same agency [25]. In a POMDP, ITBR is equivalent to the divergence objective, which however differs from the active inference objective function [22]. While the explorative/information-seeking terms are equal, the exploitative/reward-oriented term differs: EQ(o|a)[Q(o|s)]subscript𝐸𝑄conditional𝑜𝑎delimited-[]𝑄conditional𝑜𝑠E_{Q(o|a)}\mathfrak{H}[Q(o|s)]italic_E start_POSTSUBSCRIPT italic_Q ( italic_o | italic_a ) end_POSTSUBSCRIPT fraktur_H [ italic_Q ( italic_o | italic_s ) ] must be subtracted from the active inference objective function, and the preference distributions are not necessarily equal.
An area where expected utility cannot compete however is in the first principles which motivate agency [12], [13] [14], [2], [11]. Still, the debate on what objective function follows from the first principles is not yet sealed in this flourishing field [22]. Perhaps more intriguing links between brain function and the physical interpretations of information theory lurk underneath the bridge established here. Furthermore, computational simulations [16] and empirical studies [35] might flesh out the practical comparison between bounded expected utility and active inference; Computational efficiency has not remotely been addressed in this paper. Finally, it would be especially interesting for economics to understand how an economy could develop from multiple ITBR or active inference agents [18]. By integrating interdisciplinary approaches to agency, we aim to foster a holistic understanding of agency that enriches the roles of both human and artificial agents in society.

{credits}

5.0.1 Acknowledgements

I would like to express my immense gratitude to my supervisor for allowing me to delve into this topic and lending his support along the way. Further, I want to thank the various researchers willing to so openly discuss the contents and concepts of the paper. Only thanks to those fruitful exchanges could these connections across varying fields even be grasped.

5.0.2 \discintname

The author has no competing interests to declare that are relevant to the content of this article.

References

  • [1] Arrow, K.J.: Essays in the Theory of Risk Bearing. Markham Publishing Co, Chicago (1971)
  • [2] Barp, A., Da Costa, L., França, G., Friston, K., Girolami, M., Jordan, M.I., Pavliotis, G.A.: Geometric methods for sampling, optimisation, inference and adaptive agents. Handbook of Statistics 46, 21–78 (2022). https://doi.org/10.48550/arXiv.2203.10592, https://doi.org/10.48550/arXiv.2203.10592, arXiv:2203.10592v3 [stat.ML]
  • [3] Barto, A., Sutton, R.S.: Reinforcement Learning: An Introduction. The MIT Press, 2nd edn. (2018)
  • [4] Bellman, R.: A markovian decision process. Journal of Mathematics and Mechanics 6, 679–684 (1957). https://doi.org/10.1512/iumj.1957.6.56038, https://doi.org/10.1512/iumj.1957.6.56038
  • [5] Bentham, J.: An Introduction to the Principles of Morals and Legislation. Batoche Books, Kitchener, 2000 edn. (1781)
  • [6] Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer Series in Statistics, Springer-Verlag, New York, 2nd edn. (1985). https://doi.org/10.1007/978-1-4757-4286-2
  • [7] Berger-Tal, O., Nathan, J., Meron, E., Saltz, D.: The exploration-exploitation dilemma: A multidisciplinary framework. PLOS ONE 9(4), e95693 (April 2014). https://doi.org/10.1371/journal.pone.0095693
  • [8] Bonanno, G.: Decision making (2017), https://faculty.econ.ucdavis.edu/ faculty/bonanno/PDF/DM_book.pdf
  • [9] Da Costa, L., Parr, T., Sajid, N., Veselic, S., Neacsu, V., Friston, K.: Active inference on discrete state-spaces: a synthesis. Journal of Mathematical Psychology 102447,  36 (2021). https://doi.org/10.1016/j.jmp.2020.102447, https://doi.org/10.48550/arXiv.2001.07203, submitted on 20 Jan 2020 (v1), last revised 28 Mar 2020 (this version, v2)
  • [10] Da Costa, L., Sajid, N., Parr, T., Friston, K., Smith, R.: Reward maximisation through discrete active inference. arXiv preprint arXiv:2009.08111 v4, 18 pages (2022), https://doi.org/10.48550/arXiv.2009.08111
  • [11] Da Costa, L., Tenka, S., Zhao, D., Sajid, N.: Active inference as a model of agency. arXiv preprint arXiv:2401.12917 (2024), https://doi.org/10.48550/arXiv.2401.12917, accepted in RLDM2022 for the workshop ’RL as a model of agency’
  • [12] Friston, K.: The free-energy principle: A rough guide to the brain? Trends in Cognitive Sciences 13(7), 293–301 (July 2009). https://doi.org/10.1016/j.tics.2009.04.005
  • [13] Friston, K., Da Costa, L., Sajid, N., Heins, C., Ueltzhöffer, K., Pavliotis, G.A., Parr, T.: The free energy principle made simpler but not too simple. Physics Reports 1024, 1–29 (June 2023). https://doi.org/10.1016/j.physrep.2023.07.001
  • [14] Friston, K., Da Costa, L., Sakthivadivel, D.A., Heins, C., Pavliotis, G.A., Ramstead, M., Parr, T.: Path integrals, particular kinds, and strange things. Physics of Life Reviews 47 (2023). https://doi.org/10.1016/j.plrev.2023.08.016, https://doi.org/10.48550/arXiv.2210.12761
  • [15] Friston, K., Rigoli, F., Ognibene, D., Mathys, C., Fitzgerald, T., Pezzulo, G.: Active inference and epistemic value. COGNITIVE NEUROSCIENCE 6(4), 187–224 (2015). https://doi.org/10.1080/17588928.2015.1020053, http://dx.doi.org/10.1080/17588928.2015.1020053
  • [16] Genewein, T., Leibfried, F., Grau-Moya, J., Braun, D.A.: Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle. Frontiers in Robotics and AI 2,  27 (2015). https://doi.org/10.3389/frobt.2015.00027, https://doi.org/10.3389/frobt.2015.00027, this article is part of the Research Topic Theory and Applications of Guided Self-Organisation in Real and Synthetic Dynamical Systems
  • [17] Henriksen, M.: Variational free energy and economics: Optimizing with biases and bounded rationality. Frontiers in Psychology 11 (November 2020). https://doi.org/10.3389/fpsyg.2020.549187, https://doi.org/10.3389/fpsyg.2020.549187
  • [18] Hyland, D., Gavenciak, T., Da Costa, L., Heins, C., Kovarik, V., Gutierrez, J., Wooldridge, M., Kulveit, J.: Multi-agent active inference. Forthcoming Manuscript in preparation
  • [19] Kahneman, D., Tversky, A.: Prospect theory: An analysis of decision under risk. Econometrica 47(2), 263–291 (March 1979)
  • [20] Littman, M.: A tutorial on partially observable markov decision processes. Journal of Mathematical Psychology 53(2), 119–125 (2009)
  • [21] Millidge, B., Seth, A., Buckley, C.: Understanding the origin of information-seeking exploration in probabilistic objectives for control. arXiv preprint arXiv:2103.06859 (2021), https://doi.org/10.48550/arXiv.2103.06859, submitted on 11 Mar 2021 (v1), last revised 24 Nov 2021 (this version, v7)
  • [22] Millidge, B., Tschantz, A., Buckley, C.L.: Whence the expected free energy? Neural Computation 33(2), 447–482 (February 2021). https://doi.org/10.1162/neco_a_01354, https://doi.org/10.1162/neco_a_01354
  • [23] Mongin, P.: A concept of progress for normative economics. Economics and Philosophy 22, 19–54 (2006). https://doi.org/10.1017/S0266267105000696
  • [24] von Neumann, J., Morgenstern, O.: Theory of Games and Economic Behavior. Princeton University Press, Princeton, NJ (1953)
  • [25] Ortega, P.A., Braun, D.A.: What is epistemic value in free energy models of learning and acting? a bounded rationality perspective. Cognitive Neuroscience 6(4), 215–216 (2015). https://doi.org/10.1080/17588928.2015.1051525, https://doi.org/10.1080/17588928.2015.1051525
  • [26] Ortega, P.A., Braun, D.A., Dyer, J., Kim, K.E., Tishby, N.: Information-theoretic bounded rationality. arXiv preprint arXiv:1512.06789 (2015), https://doi.org/10.48550/arXiv.1512.06789, submitted on 21 Dec 2015
  • [27] Parr, T., Markovic, D., Kiebel, S.J., Friston, K.J.: Neuronal message passing using mean-field, bethe, and marginal approximations. Scientific Reports 9(1), 1–18 (2019). https://doi.org/10.1038/s41598-019-50764-9
  • [28] Parr, T., Pezzulo, G., Friston, K.J.: Active Inference: The Free Energy Principle in Mind, Brain, and Behavior. The MIT Press (2022)
  • [29] Pezzulo, G., Parr, T., Friston, K.: Active inference as a theory of sentient behavior. Biological Psychology 186 (February 2024). https://doi.org/10.1016/j.biopsycho.2023.108741, under a Creative Commons license
  • [30] Pigou, A.C.: The Economics of Welfare. Macmillan and Co., Limited, London (1920)
  • [31] Ramsey, F.P.: Truth and probability. In: Braithwaite, R.B. (ed.) The Foundations of Mathematics and Other Logical Essays, chap. VII, pp. 156–198. Kegan, Paul, Trench, Trubner & Co. and Harcourt, Brace and Company, London and New York (1931), originally published in 1926
  • [32] Ross, D.: Philosophy of Economics. Palgrave Philosophy Today, Palgrave Macmillan London, 1 edn. (2014). https://doi.org/10.1057/9781137318756, https://doi.org/10.1057/9781137318756
  • [33] Sajid, N., Da Costa, L., Parr, T., Friston, K.: Active inference, bayesian optimal design, and expected utility. In: Cogliati Dezza, I., Schulz, E., Wu, C.M. (eds.) The Drive for Knowledge: The Science of Human Information Seeking, pp. 124–146. Cambridge University Press, Cambridge (2022)
  • [34] Savage, L.J.: The Foundations of Statistics. Dover Publications, Inc., New York, N.Y., revised and enlarged edn. (1972), originally published by John Wiley & Sons in 1954
  • [35] Schwartenbeck, P., FitzGerald, T.H.B., Dolan, R.J., Friston, K.J.: Evidence for surprise minimization over value maximization in choice behavior. Scientific Reports 5, 16575 (2015). https://doi.org/10.1038/srep16575, https://doi.org/10.1038/srep16575
  • [36] Simon, H.A.: Models of Man. John Wiley & Sons, New York (1957)
  • [37] Szipro, G.: Risk, Choice, and Uncertainty: Three Centuries of Economic Decision-Making. Columbia University Press, New York City (2020)

Appendix Appendix

A:

Resolving the St. Petersburg Paradox

Consider a lottery on the outcome of a fair coin toss. Starting at two dollars, the stake doubles with every subsequent outcome of heads. The game ends once tails comes up for the first time in the sequence. The expected payout E[L]𝐸delimited-[]𝐿E[L]italic_E [ italic_L ] of the game is thus infinite:

E[L]=i=112i2i=𝐸delimited-[]𝐿superscriptsubscript𝑖11superscript2𝑖superscript2𝑖E[L]=\sum\limits_{i=1}^{\infty}\frac{1}{2^{i}}\cdot 2^{i}=\inftyitalic_E [ italic_L ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ⋅ 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ∞

How much would someone pay to participate in this game? Taking a linear utility function on the payout, the gambler should be willing to pay any amount to enter the game. Daniel Bernoulli suggested a logarithmic utility function U(x)=ln(x)𝑈𝑥𝑙𝑛𝑥U(x)=ln(x)italic_U ( italic_x ) = italic_l italic_n ( italic_x ). Assume the cost of entry is x𝑥xitalic_x. Then the expected utility of the lottery has a finite value; The amount the agent at most would be willing to enter the lottery:

E[U(L)]=i=112iln(2i)=2ln(2)𝐸delimited-[]𝑈𝐿superscriptsubscript𝑖11superscript2𝑖𝑙𝑛superscript2𝑖2𝑙𝑛2E[U(L)]=\sum\limits_{i=1}^{\infty}\frac{1}{2^{i}}\cdot ln(2^{i})=2\cdot ln(2)italic_E [ italic_U ( italic_L ) ] = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ⋅ italic_l italic_n ( 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = 2 ⋅ italic_l italic_n ( 2 )

Therefore the agent expects finite utility from the payout of the lottery due to the concavity of the utility function. As such, only a finite amount will be paid to enter the game.

B:

Expected Utility and Active Inference for the ’Paraglider’ MDP

The single-step MDP is specified as follows. Therefore note that the subscript does not pertain to the period:

𝕊=s1,s2,s3𝕊subscript𝑠1subscript𝑠2subscript𝑠3\displaystyle\mathbb{S}={s_{1},s_{2},s_{3}}blackboard_S = italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
𝔸=a1,a2𝔸subscript𝑎1subscript𝑎2\displaystyle\mathbb{A}={a_{1},a_{2}}blackboard_A = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
{P(s1|a1),P(s2|a1),P(s3|a1)}={0.6,0,0.4}𝑃conditionalsubscript𝑠1subscript𝑎1𝑃conditionalsubscript𝑠2subscript𝑎1𝑃conditionalsubscript𝑠3subscript𝑎10.600.4\displaystyle\{P(s_{1}|a_{1}),P(s_{2}|a_{1}),P(s_{3}|a_{1})\}=\{0.6,0,0.4\}{ italic_P ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_P ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_P ( italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) } = { 0.6 , 0 , 0.4 }
{P(s1|a2),P(s2|a2),P(s3|a2)}={0,0.4,0.6}𝑃conditionalsubscript𝑠1subscript𝑎2𝑃conditionalsubscript𝑠2subscript𝑎2𝑃conditionalsubscript𝑠3subscript𝑎200.40.6\displaystyle\{P(s_{1}|a_{2}),P(s_{2}|a_{2}),P(s_{3}|a_{2})\}=\{0,0.4,0.6\}{ italic_P ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_P ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_P ( italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) } = { 0 , 0.4 , 0.6 }
{R(s1),R(s2),R(s3)}={1,1.5,0}𝑅subscript𝑠1𝑅subscript𝑠2𝑅subscript𝑠311.50\displaystyle\{R(s_{1}),R(s_{2}),R(s_{3})\}=\{1,1.5,0\}{ italic_R ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_R ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_R ( italic_s start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) } = { 1 , 1.5 , 0 }

Consider an expected utility agent with utility function U(R(s))=R(s)c𝑈𝑅𝑠𝑅superscript𝑠𝑐U(R(s))=R(s)^{c}italic_U ( italic_R ( italic_s ) ) = italic_R ( italic_s ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT where c+𝑐superscriptc\in\mathbb{R}^{+}italic_c ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT. As such,

E[U(a1)]=0.61c𝐸delimited-[]𝑈subscript𝑎10.6superscript1𝑐\displaystyle E[U(a_{1})]=0.6\cdot 1^{c}italic_E [ italic_U ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] = 0.6 ⋅ 1 start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT
E[U(a2)]=0.41.5c𝐸delimited-[]𝑈subscript𝑎20.4superscript1.5𝑐\displaystyle E[U(a_{2})]=0.4\cdot 1.5^{c}italic_E [ italic_U ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] = 0.4 ⋅ 1.5 start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT
For c<1argmaxa𝔸E[U(a)]=a1For c<1𝑎𝔸𝐸delimited-[]𝑈𝑎subscript𝑎1\displaystyle\text{For $c<1$}\rightarrow\underset{a\in\mathbb{A}}{\arg\max}\ E% [U(a)]=a_{1}For italic_c < 1 → start_UNDERACCENT italic_a ∈ blackboard_A end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_E [ italic_U ( italic_a ) ] = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
For c>1argmaxa𝔸E[U(a)]=a2For c>1𝑎𝔸𝐸delimited-[]𝑈𝑎subscript𝑎2\displaystyle\text{For $c>1$}\rightarrow\underset{a\in\mathbb{A}}{\arg\max}\ E% [U(a)]=a_{2}For italic_c > 1 → start_UNDERACCENT italic_a ∈ blackboard_A end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_E [ italic_U ( italic_a ) ] = italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

So a risk-averse expected utility agent will scale the smaller but safer mountain.

The active inference agent however is indifferent between the two actions. If we assume the preference distribution to be a softmax on the rewards, then we can ignore the normalizing denominator as it is constant w.r.t to action. Therefore we can write the relevant objective function as:

G(at)=sP(sτ|at)R(sτ)sP(sτ|at)log1P(sτ|at)𝐺subscript𝑎𝑡subscript𝑠𝑃conditionalsubscript𝑠𝜏subscript𝑎𝑡𝑅subscript𝑠𝜏subscript𝑠𝑃conditionalsubscript𝑠𝜏subscript𝑎𝑡𝑙𝑜𝑔1𝑃conditionalsubscript𝑠𝜏subscript𝑎𝑡\displaystyle G(a_{t})=-\sum\limits_{s}P(s_{\tau}|a_{t})\cdot R(s_{\tau})-\sum% \limits_{s}P(s_{\tau}|a_{t})log\frac{1}{P(s_{\tau}|a_{t})}italic_G ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ italic_R ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_l italic_o italic_g divide start_ARG 1 end_ARG start_ARG italic_P ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG
G(a1)=0.60.30650.366=G(a2)𝐺subscript𝑎10.60.30650.366𝐺subscript𝑎2\displaystyle G(a_{1})=-0.6-0.3065-0.366=G(a_{2})italic_G ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = - 0.6 - 0.3065 - 0.366 = italic_G ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
argmina𝔸G(a)={a1,a2}absent𝑎𝔸𝐺𝑎subscript𝑎1subscript𝑎2\displaystyle\rightarrow\ \underset{a\in\mathbb{A}}{\arg\min}G(a)=\{a_{1},a_{2}\}→ start_UNDERACCENT italic_a ∈ blackboard_A end_UNDERACCENT start_ARG roman_arg roman_min end_ARG italic_G ( italic_a ) = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }

Therefore the optimal action of the risk-averse expected utility agent is a subset of the optimal active inference agency.

C:

Preference distribution derivation

We maximize the ITBR objective function (7) via first order condition.

δFITBRδQ(s|a)=U(s,a)1β(logQ(s|a)P(s|a)+1)=!0𝛿subscript𝐹𝐼𝑇𝐵𝑅𝛿𝑄conditional𝑠𝑎𝑈𝑠𝑎1𝛽𝑙𝑜𝑔𝑄conditional𝑠𝑎𝑃conditional𝑠𝑎1superscript0\displaystyle\frac{\delta F_{ITBR}}{\delta Q(s|a)}=\ U(s,a)-\frac{1}{\beta}% \left(log\frac{Q(s|a)}{P(s|a)}+1\right)\stackrel{{\scriptstyle!}}{{=}}0divide start_ARG italic_δ italic_F start_POSTSUBSCRIPT italic_I italic_T italic_B italic_R end_POSTSUBSCRIPT end_ARG start_ARG italic_δ italic_Q ( italic_s | italic_a ) end_ARG = italic_U ( italic_s , italic_a ) - divide start_ARG 1 end_ARG start_ARG italic_β end_ARG ( italic_l italic_o italic_g divide start_ARG italic_Q ( italic_s | italic_a ) end_ARG start_ARG italic_P ( italic_s | italic_a ) end_ARG + 1 ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ! end_ARG end_RELOP 0

Solve for Q(s|a)𝑄conditional𝑠𝑎Q(s|a)italic_Q ( italic_s | italic_a ), and normalize to attain the Gibbs distribution

Q(s|a)=P(s|a)eβU(s,a)1P(s|a)eβU(s,a)𝑄conditional𝑠𝑎𝑃conditional𝑠𝑎superscript𝑒𝛽𝑈𝑠𝑎1proportional-to𝑃conditional𝑠𝑎superscript𝑒𝛽𝑈𝑠𝑎\displaystyle Q(s|a)=P(s|a)e^{\beta U(s,a)-1}\propto P(s|a)e^{\beta U(s,a)}italic_Q ( italic_s | italic_a ) = italic_P ( italic_s | italic_a ) italic_e start_POSTSUPERSCRIPT italic_β italic_U ( italic_s , italic_a ) - 1 end_POSTSUPERSCRIPT ∝ italic_P ( italic_s | italic_a ) italic_e start_POSTSUPERSCRIPT italic_β italic_U ( italic_s , italic_a ) end_POSTSUPERSCRIPT
Q(s|a)=P(s|a)eβU(s,a)sP(s|a)eβU(s,a)=P(s|a)eβU(s,a)Zβsuperscript𝑄conditional𝑠𝑎𝑃conditional𝑠𝑎superscript𝑒𝛽𝑈𝑠𝑎subscript𝑠𝑃conditional𝑠𝑎superscript𝑒𝛽𝑈𝑠𝑎𝑃conditional𝑠𝑎superscript𝑒𝛽𝑈𝑠𝑎subscript𝑍𝛽\displaystyle Q^{*}(s|a)=\frac{P(s|a)e^{\beta U(s,a)}}{\sum\limits_{s}P(s|a)e^% {\beta U(s,a)}}=\frac{P(s|a)e^{\beta U(s,a)}}{Z_{\beta}}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s | italic_a ) = divide start_ARG italic_P ( italic_s | italic_a ) italic_e start_POSTSUPERSCRIPT italic_β italic_U ( italic_s , italic_a ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_P ( italic_s | italic_a ) italic_e start_POSTSUPERSCRIPT italic_β italic_U ( italic_s , italic_a ) end_POSTSUPERSCRIPT end_ARG = divide start_ARG italic_P ( italic_s | italic_a ) italic_e start_POSTSUPERSCRIPT italic_β italic_U ( italic_s , italic_a ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_Z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_ARG

Which gives us (8)

D:

Getting from ITBR to the divergence objective via the Gibbs distribution.

Solve (8) for U(s,a)𝑈𝑠𝑎U(s,a)italic_U ( italic_s , italic_a ):

P(s|a)=P(s|a)eβU(s)Zβsuperscript𝑃conditional𝑠𝑎𝑃conditional𝑠𝑎superscript𝑒𝛽𝑈𝑠subscript𝑍𝛽\displaystyle P^{*}(s|a)=\frac{P(s|a)e^{\beta U(s)}}{Z_{\beta}}italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s | italic_a ) = divide start_ARG italic_P ( italic_s | italic_a ) italic_e start_POSTSUPERSCRIPT italic_β italic_U ( italic_s ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_Z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_ARG
1βln(P(s|a)Zβ)=U(s)1𝛽𝑙𝑛superscript𝑃conditional𝑠𝑎subscript𝑍𝛽𝑈𝑠\displaystyle\frac{1}{\beta}ln(P^{*}(s|a)\cdot Z_{\beta})=U(s)divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_l italic_n ( italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s | italic_a ) ⋅ italic_Z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) = italic_U ( italic_s )

Plug this into the ITBR objective function (12) and consider the maximizing argument a𝑎aitalic_a:

argmaxa𝔸1βEQ(s|a)[lnP(s|a)+ln(Zβ)]1βlnQ(s|a)P(s|a)𝑎𝔸1𝛽subscript𝐸𝑄conditional𝑠𝑎delimited-[]𝑙𝑛superscript𝑃conditional𝑠𝑎𝑙𝑛subscript𝑍𝛽1𝛽𝑙𝑛𝑄conditional𝑠𝑎𝑃conditional𝑠𝑎\displaystyle\underset{a\in\mathbb{A}}{\arg\max}\ \frac{1}{\beta}E_{Q(s|a)}[% lnP^{*}(s|a)+ln(Z_{\beta})]-\frac{1}{\beta}ln\frac{Q(s|a)}{P(s|a)}start_UNDERACCENT italic_a ∈ blackboard_A end_UNDERACCENT start_ARG roman_arg roman_max end_ARG divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_E start_POSTSUBSCRIPT italic_Q ( italic_s | italic_a ) end_POSTSUBSCRIPT [ italic_l italic_n italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s | italic_a ) + italic_l italic_n ( italic_Z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) ] - divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_l italic_n divide start_ARG italic_Q ( italic_s | italic_a ) end_ARG start_ARG italic_P ( italic_s | italic_a ) end_ARG
=argmaxa𝔸EQ(s|a)[lnP(s|a)+ln(Zβ)lnQ(s|a)+lnP(s|a)]absent𝑎𝔸subscript𝐸𝑄conditional𝑠𝑎delimited-[]𝑙𝑛superscript𝑃conditional𝑠𝑎𝑙𝑛subscript𝑍𝛽𝑙𝑛𝑄conditional𝑠𝑎𝑙𝑛𝑃conditional𝑠𝑎\displaystyle=\underset{a\in\mathbb{A}}{\arg\max}\ E_{Q(s|a)}[lnP^{*}(s|a)+ln(% Z_{\beta})-lnQ(s|a)+lnP(s|a)]= start_UNDERACCENT italic_a ∈ blackboard_A end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_E start_POSTSUBSCRIPT italic_Q ( italic_s | italic_a ) end_POSTSUBSCRIPT [ italic_l italic_n italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s | italic_a ) + italic_l italic_n ( italic_Z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) - italic_l italic_n italic_Q ( italic_s | italic_a ) + italic_l italic_n italic_P ( italic_s | italic_a ) ]
=argmina𝔸EQ(s|a)[lnP(s|a)ln(Zβ)+lnQ(s|a)lnP(s|a)]absent𝑎𝔸subscript𝐸𝑄conditional𝑠𝑎delimited-[]𝑙𝑛superscript𝑃conditional𝑠𝑎𝑙𝑛subscript𝑍𝛽𝑙𝑛𝑄conditional𝑠𝑎𝑙𝑛𝑃conditional𝑠𝑎\displaystyle=\underset{a\in\mathbb{A}}{\arg\min}\ E_{Q(s|a)}[-lnP^{*}(s|a)-ln% (Z_{\beta})+lnQ(s|a)-lnP(s|a)]= start_UNDERACCENT italic_a ∈ blackboard_A end_UNDERACCENT start_ARG roman_arg roman_min end_ARG italic_E start_POSTSUBSCRIPT italic_Q ( italic_s | italic_a ) end_POSTSUBSCRIPT [ - italic_l italic_n italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s | italic_a ) - italic_l italic_n ( italic_Z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) + italic_l italic_n italic_Q ( italic_s | italic_a ) - italic_l italic_n italic_P ( italic_s | italic_a ) ]
=argmina𝔸DKL[Q(s|a)||P(s|a)]\displaystyle=\underset{a\in\mathbb{A}}{\arg\min}\ D_{KL}[Q(s|a)||P^{*}(s|a)]= start_UNDERACCENT italic_a ∈ blackboard_A end_UNDERACCENT start_ARG roman_arg roman_min end_ARG italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_Q ( italic_s | italic_a ) | | italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s | italic_a ) ]

Which is the divergence objective for MDPs (5). In a POMDP setting, the derivation proceeds analogously to obtain the Free Energy of the Expected Future (10).