1. Greedy apporach : Just taking an action that gives the biggest reward at each time.

- Cons : It the second best action was taken at the first time, it cannot be changed after.

 

2. Random approach : At every time, taking random action.

- Cons : Ideal only when a random policy is the optimal.

 

3. Epsilon-greedy apporach : Takes an action that gives the biggest reward, but with small probability, take an action randomly. Epsilon is a hyperparameter that means the probability to take a random action. Usually initialized as high and gradually decreases as a small constant (e.g. 0.1)  -- Almost standard

 

4. Boltzman approach : Takes an action with weighted probability. Use softmax to get estimates for each action.

- Pros : Consider also information about the value for the other actions.

'Reinforcement Learning' 카테고리의 다른 글

Reinforcement learning from scratch  (0) 2019.07.24

Multi-armed bandit : $policy \ \pi$ = network weight
Context bandit : $policy \ \pi$ = action selected from output layer
Complete reinforcement learning

Markov Decision Process (MDP)
state $s \in S$, action $a \in A$, in each episode, given $(s, a)$, the probability of transition to a new state $s'$ can be represented as $T(s, a)$, and the reward as $R(s, a)$. In MDP, an agent takes action $a$ in $s$ and get reward $r$ in state $s'$.

Bellman Equation
$Q(s, a)^* = r + \gamma(max(Q(s', a'))$
The expected long-term reward given to an action equals to a sum of instant reward from this time's action and the expected reward from the best action taken in next state.

Q-learning
Not directly mapping an observation to an action, but trying to learn a value in each state and takes a certain action. Q-learning learns the reward from the last time step $t$.

Deep Q-learning

  1. Expand one-layer network to multi-layer convolutional network
  2. Implementation of experience buffer. That is, a network learns using experiences stored in its buffer.
  3. Calculate target Q when update, using another target network.

Experience buffer
Store experiences of an agent and train its network later, sampling randomly some experiences from its buffer (state, action, reward, next state). Old experiences are replaced by new experiences.

Target Network
Using #2 network, generate target Q value used to calculate the costs for every action. For the possibility of loop between target Q value and predicted Q value, periodically update Q network.

Double DQN (DDQN)
An method that generates target Q value from target network, for the action selected from #1 network instead of taking the maximum value of Q values.

$Q_{target} = r + \gamma Q(s', argmax(Q(s', a, \theta), \theta')$

Dueling DQN
Q : The degree that shows how good is an action taken in a certain state

$Q(s, a) = V(s) + A(a)$

Value function $V(s)$ : How much good is a state

Advantage function $A(a)$ : How much better to take an action than another action

Asynchronous Advantage Actor-Critic (A3C)


Global network and agents that has independent network parameters.
Actor-Critic : In A3C, the network evaluates both $V(S)$ and policy $\pi(s)$ (output possibilities for each actions) by fully-connect networks.
--> Each agent updates its policy using estimate critically.

Advantage : Until before, we used discounted reward to distinguish whether an action is good or bad. $R = \gamma(r)$
In A3C, we use advantage estimates because we should be able to evaluate how much better result was produced.

$A = Q(s, a) - V(s) = R - V(s)$

'Reinforcement Learning' 카테고리의 다른 글

How to search action space  (0) 2019.07.24

+ Recent posts