A dynamic decision problem Let x_t be the state at time t. For a decision that begins at time 0, we take as given the initial state x_0. At any time, the set of possible actions depends on the current state; we express this as a_{t} \in \Gamma (x_t), where a particular action a_t represents particular values for one or more control variables, and \Gamma (x_t) is the set of actions available to be taken at state x_t. It is also assumed that the state changes from x to a new state T(x,a) when action a is taken, and that the current payoff from taking action a in state x is F(x,a). Finally, we assume impatience, represented by a
discount factor 0. Under these assumptions, an infinite-horizon decision problem takes the following form: : V(x_0) \; = \; \max_{ \left \{ a_{t} \right \}_{t=0}^{\infty} } \sum_{t=0}^{\infty} \beta^t F(x_t,a_{t}), subject to the constraints : a_{t} \in \Gamma (x_t), \; x_{t+1}=T(x_t,a_t), \; \forall t = 0, 1, 2, \dots Notice that we have defined notation V(x_0) to denote the optimal value that can be obtained by maximizing this objective function subject to the assumed constraints. This function is the
value function. It is a function of the initial state variable x_0, since the best value obtainable depends on the initial situation.
Bellman's principle of optimality The dynamic programming method breaks this decision problem into smaller subproblems. Bellman's
principle of optimality describes how to do this:Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. (See Bellman, 1957, Chap. III.3.) In computer science, a problem that can be broken apart like this is said to have
optimal substructure. In the context of dynamic
game theory, this principle is analogous to the concept of
subgame perfect equilibrium, although what constitutes an optimal policy in this case is conditioned on the decision-maker's opponents choosing similarly optimal policies from their points of view. As suggested by the
principle of optimality, we will consider the first decision separately, setting aside all future decisions (we will start afresh from time 1 with the new state x_1 ). Collecting the future decisions in brackets on the right, the above infinite-horizon decision problem is equivalent to: : \max_{ a_0 } \left \{ F(x_0,a_0) + \beta \left[ \max_{ \left \{ a_{t} \right \}_{t=1}^{\infty} } \sum_{t=1}^{\infty} \beta^{t-1} F(x_t,a_{t}): a_{t} \in \Gamma (x_t), \; x_{t+1}=T(x_t,a_t), \; \forall t \geq 1 \right] \right \} subject to the constraints : a_0 \in \Gamma (x_0), \; x_1=T(x_0,a_0). Here we are choosing a_0, knowing that our choice will cause the time 1 state to be x_1=T(x_0,a_0). That new state will then affect the decision problem from time 1 on. The whole future decision problem appears inside the square brackets on the right.
The Bellman equation So far it seems we have only made the problem uglier by separating today's decision from future decisions. But we can simplify by noticing that what is inside the square brackets on the right is
the value of the time 1 decision problem, starting from state x_1=T(x_0,a_0). Therefore, the problem can be rewritten as a
recursive definition of the value function: :V(x_0) = \max_{ a_0 } \{ F(x_0,a_0) + \beta V(x_1) \} , subject to the constraints: a_0 \in \Gamma (x_0), \; x_1=T(x_0,a_0). This is the Bellman equation. It may be simplified even further if the time subscripts are dropped and the value of the next state is plugged in: :V(x) = \max_{a \in \Gamma (x) } \{ F(x,a) + \beta V(T(x,a)) \}. The Bellman equation is classified as a
functional equation, because solving it means finding the unknown function V, which is the
value function. Recall that the value function describes the best possible value of the objective, as a function of the state x. By calculating the value function, we will also find the function a(x) that describes the optimal action as a function of the state; this is called the
policy function.
In a stochastic problem In the deterministic setting, other techniques besides dynamic programming can be used to tackle the above
optimal control problem. However, the Bellman Equation is often the most convenient method of solving
stochastic optimal control problems. For a specific example from economics, consider an infinitely-lived consumer with initial wealth endowment {\color{Red}a_0} at period 0. They have an instantaneous
utility function u(c) where c denotes consumption and discounts the next period utility at a rate of 0. Assume that what is not consumed in period t carries over to the next period with interest rate r. Then the consumer's utility maximization problem is to choose a consumption plan \{{\color{OliveGreen}c_t}\} that solves :\max \sum_{t=0} ^{\infty} \beta^t u ({\color{OliveGreen}c_t}) subject to :{\color{Red}a_{t+1}} = (1 + r) ({\color{Red}a_t} - {\color{OliveGreen}c_t}), \; {\color{OliveGreen}c_t} \geq 0, and :\lim_{t \rightarrow \infty} {\color{Red}a_t} \geq 0. The first constraint is the capital accumulation/law of motion specified by the problem, while the second constraint is a
transversality condition that the consumer does not carry debt at the end of their life. The Bellman equation is :V(a) = \max_{ 0 \leq c \leq a } \{ u(c) + \beta V((1+r) (a - c)) \}, Alternatively, one can treat the sequence problem directly using, for example, the
Hamiltonian equations. Now, if the interest rate varies from period to period, the consumer is faced with a stochastic optimization problem. Let the interest
r follow a
Markov process with probability transition function Q(r, d\mu_r) where d\mu_r denotes the
probability measure governing the distribution of interest rate next period if current interest rate is r. In this model the consumer decides their current period consumption after the current period interest rate is announced. Rather than simply choosing a single sequence \{{\color{OliveGreen}c_t}\}, the consumer now must choose a sequence \{{\color{OliveGreen}c_t}\} for each possible realization of a \{r_t\} in such a way that their lifetime expected utility is maximized: :\max_{ \left \{ c_{t} \right \}_{t=0}^{\infty} } \mathbb{E}\bigg( \sum_{t=0} ^{\infty} \beta^t u ({\color{OliveGreen}c_t}) \bigg). The expectation \mathbb{E} is taken with respect to the appropriate probability measure given by
Q on the sequences of
rs. Because
r is governed by a Markov process, dynamic programming simplifies the problem significantly. Then the Bellman equation is simply: :V(a, r) = \max_{ 0 \leq c \leq a } \{ u(c) + \beta \int V((1+r) (a - c), r') Q(r, d\mu_r) \} . Under some reasonable assumption, the resulting optimal policy function
g(
a,
r) is
measurable. For a general stochastic sequential optimization problem with Markovian shocks and where the agent is faced with their decision
ex-post, the Bellman equation takes a very similar form :V(x, z) = \max_{c \in \Gamma(x,z)} \{F(x, c, z) + \beta \int V( T(x,c), z') d\mu_z(z')\}. == Solution methods ==