AIXI is a reinforcement learning agent that interacts with some stochastic and unknown but computable environment \mu. The interaction proceeds in time steps, from t=1 to t=m, where m \in \mathbb{N} is the lifespan of the AIXI agent. At time step
t, the agent chooses an action a_t \in \mathcal{A} (e.g. a limb movement) and executes it in the environment, and the environment responds with a "percept" e_t \in \mathcal{E} = \mathcal{O} \times \mathbb{R}, which consists of an "observation" o_t \in \mathcal{O} (e.g., a camera image) and a reward r_t \in \mathbb{R}, distributed according to the
conditional probability \mu(o_t r_t | a_1 o_1 r_1 ... a_{t-1} o_{t-1} r_{t-1} a_t), where a_1 o_1 r_1 ... a_{t-1} o_{t-1} r_{t-1} a_t is the "history" of actions, observations and rewards. The environment \mu is thus mathematically represented as a
probability distribution over "percepts" (observations and rewards) which depend on the
full history, so there is no
Markov assumption (as opposed to other RL algorithms). Note again that this probability distribution is
unknown to the AIXI agent. Furthermore, note again that \mu is computable, that is, the observations and rewards received by the agent from the environment \mu can be computed by some program (which runs on a
Turing machine), given the past actions of the AIXI agent. The
only goal of the AIXI agent is to maximize \sum_{t=1}^m r_t, that is, the sum of rewards from time step 1 to m. The AIXI agent is associated with a stochastic policy \pi : (\mathcal{A} \times \mathcal{E})^* \rightarrow \mathcal{A}, which is the function it uses to choose actions at every time step, where \mathcal{A} is the space of all possible actions that AIXI can take and \mathcal{E} is the space of all possible "percepts" that can be produced by the environment. The environment (or probability distribution) \mu can also be thought of as a stochastic policy (which is a function): \mu : (\mathcal{A} \times \mathcal{E})^* \times \mathcal{A} \rightarrow \mathcal{E} , where the * is the
Kleene star operation. In general, at time step t (which ranges from 1 to m), AIXI, having previously executed actions a_1\dots a_{t-1} (which is often abbreviated in the literature as a_{) and having observed the history of percepts o_1 r_1 ... o_{t-1} r_{t-1} (which can be abbreviated as e_{), chooses and executes in the environment the action, a_t, defined as follows: : a_t := \arg \max_{a_t} \sum_{o_t r_t} \ldots \max_{a_m} \sum_{o_m r_m} [r_t + \ldots + r_m] \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)} or, using parentheses, to disambiguate the precedences : a_t := \arg \max_{a_t} \left( \sum_{o_t r_t} \ldots \left( \max_{a_m} \sum_{o_m r_m} [r_t + \ldots + r_m] \left( \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)} \right) \right) \right) Intuitively, in the definition above, AIXI considers the sum of the total reward over all possible "futures" up to m - t time steps ahead (that is, from t to m), weighs each of them by the complexity of programs q (that is, by 2^{-\textrm{length}(q)}) consistent with the agent's past (that is, the previously executed actions, a_{, and received percepts, e_{) that can generate that future, and then picks the action that maximizes expected future rewards. Let us break this definition down in order to attempt to fully understand it. o_t r_t is the "percept" (which consists of the observation o_t and reward r_t) received by the AIXI agent at time step t from the environment (which is unknown and stochastic). Similarly, o_m r_m is the percept received by AIXI at time step m (the last time step where AIXI is active). r_t + \ldots + r_m is the sum of rewards from time step t to time step m, so AIXI needs to look into the future to choose its action at time step t. U denotes a
monotone universal Turing machine, and q ranges over all (deterministic) programs on the universal machine U, which receives as input the program q and the sequence of actions a_1\dots a_m (that is, all actions), and produces the sequence of percepts o_1 r_1 \ldots o_m r_m. The universal Turing machine U is thus used to "simulate" or compute the environment responses or percepts, given the program q (which "models" the environment) and all actions of the AIXI agent: in this sense, the environment is "computable" (as stated above). Note that, in general, the program which "models" the
current and actual environment (where AIXI needs to act) is unknown because the current environment is also unknown. \textrm{length}(q) is the length of the program q (which is encoded as a string of bits). Note that 2^{-\textrm{length}(q)} = \frac{1}{2^{\textrm{length}(q)}}. Hence, in the definition above, \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)} should be interpreted as a
mixture (in this case, a sum) over all computable environments (which are consistent with the agent's past), each weighted by its complexity 2^{-\textrm{length}(q)}. Note that a_1 \ldots a_m can also be written as a_1 \ldots a_{t-1}a_t \ldots a_m, and a_1 \ldots a_{t-1} = a_{ is the sequence of actions already executed in the environment by the AIXI agent. Similarly, o_1 r_1 \ldots o_m r_m = o_1 r_1 \ldots o_{t-1} r_{t-1}o_{t} r_{t} \ldots o_m r_m, and o_1 r_1 \ldots o_{t-1} r_{t-1} is the sequence of percepts produced by the environment so far. Let us now put all these components together in order to understand this equation or definition. At time step t, AIXI chooses the action a_t where the function \sum_{o_t r_t} \ldots \max_{a_m} \sum_{o_m r_m} [r_t + \ldots + r_m] \sum_{q:\; U(q, a_1 \ldots a_m) = o_1 r_1 \ldots o_m r_m} 2^{-\textrm{length}(q)} attains its maximum.
Parameters The parameters to AIXI are the universal Turing machine
U and the agent's lifetime
m, which need to be chosen. The latter parameter can be removed by the use of
discounting. == Optimality ==