How to search action space
1. Greedy apporach : Just taking an action that gives the biggest reward at each time.
- Cons : It the second best action was taken at the first time, it cannot be changed after.
2. Random approach : At every time, taking random action.
- Cons : Ideal only when a random policy is the optimal.
3. Epsilon-greedy apporach : Takes an action that gives the biggest reward, but with small probability, take an action randomly. Epsilon is a hyperparameter that means the probability to take a random action. Usually initialized as high and gradually decreases as a small constant (e.g. 0.1) -- Almost standard
4. Boltzman approach : Takes an action with weighted probability. Use softmax to get estimates for each action.
- Pros : Consider also information about the value for the other actions.