:Q^{new}(S_t, A_t) \leftarrow (1 - \alpha) Q(S_t,A_t) + \alpha \, [R_{t+1} + \gamma \, Q(S_{t+1}, A_{t+1})] A SARSA agent interacts with the environment and updates the policy based on actions taken, hence this is known as an
on-policy learning algorithm. The Q value for a state-action is updated by an error, adjusted by the
learning rate α. Q values represent the possible reward received in the next time step for taking action
a in state
s, plus the discounted future reward received from the next state-action observation. Watkin's
Q-learning updates an estimate of the optimal state-action value function Q^* based on the maximum reward of available actions. While SARSA learns the Q values associated with taking the policy it follows itself, Watkin's Q-learning learns the Q values associated with taking the optimal policy while following an
exploration/exploitation policy. Some optimizations of Watkin's Q-learning may be applied to SARSA. == Hyperparameters ==