In the training part, is it assumed that the reward is 0 in all states except the terminal state? If this is so, what will the changes in the code be if each step in an episode has a reward for a state transition? Would the rewards in the code be replaced by returns in that case?
In the training part, is it assumed that the reward is 0 in all states except the terminal state? If this is so, what will the changes in the code be if each step in an episode has a reward for a state transition? Would the rewards in the code be replaced by returns in that case?