This post will discuss reinforcement learning through policy based agents on an OpenAI environment

Policy based reinforcement learning is simply training a neural network to remember the actions that worked best in the past. This framework provides incredible flexibility and works across many envs

This post will discuss reinforcement learning through policy based agents. We’ll be using OpenAI’s gym environment which I discussed on my last post. The next few posts are heavily influenced by Arthur Juliani’s great series on reinforcement learning. You can find his take on policy based agents here.

Everything available here can be found on my github page under Part 2 — Policy-based Agents with Keras.

Also, you can check out part 1 and part 2 of my series on reinforcement learning

Policy based methods train a model to look at a state and determine the best long term action based on that state. The basic setup is as follows:

  1. Play a game more or less randomly, recording the performance
  2. Rank your performance relative to other times you’ve played the game
  3. Train a model to mimic the actions in which you had a high score
  4. Test your model

That’s all there is to it. Let’s go through each step in detail.

Step 1 is to play more or less randomly. I say more or less because you do want to use your model. The idea is that you want to act randomly but you also want to explore new areas that your bot will eventually experience. There is a trade-off between acting completely randomly (exploration) and using the bot (exploitation), also known as variance (exploration) and bias (exploitation). Arthur Juliani has a great post on the trade-off which I recommend for more information.

There are a number of ways you can handle the issue. One is simply to sometimes choose a random action and other times choose the model’s best predicted output. When you start training, you’d likely want to be more in favor of random actions, but as your bot get’s better, you should favor more what the model suggests and only sometimes act randomly.

Another way is to look at the strength of the models convictions. If the model predicts a score of .5001 and .4999 for action 1 and 2, respectively, you may as well guess. But if the score is 0.999 and 0.001, then you should probably go with action 1 since action 2 is likely a dead end and doesn’t need to be pursued.

There are other ways, and you can get creative, but for my bot I choose the second approach.

Step 2 involves ranking relative performance. Here you encounter the problem of attribution. Suppose you die or achieve a very large negative reward. Some actions preceding that moment are likely to blame, but you cannot be sure what action caused your downfall. The final move right before that moment is likely to have contributed, the action before is also likely but maybe less so, and so on. Same thing is true if you make a great move. It may be due to a move you made long before the actual reward.

Here we want to take a score and discount it. So if we have a game that rewards us one point for being alive, and we were alive for 4 turns, our reward would look like [1,1,1,1]. And suppose we had 4 actions [A,B,C,D]. Our discounted reward for action A should be the sum of 100% of the first reward, % of second reward, % of third reward, and % of fourth reward. We want to put greater weight on reward 2 than reward 4 since action A more likely had a greater influence on reward 2. So here we can use a present value of the rewards:

def discount_rewards(r, gamma=0.99):  
 “””Takes 1d float array of rewards and computes discounted reward  
 e.g. f([1, 1, 1, 1], 0.99) -> [3.94, 2.97, 1.99, 1.0]  
 “””  
 prior = 0  
 out = []  
 for val in r:  
   new_val = val + prior * gamma  
   out.append(new_val)  
   prior = new_val   
return np.array(out[::-1])

Now we have a discounted reward for a given episode. But we want relative performance. So in the example above, just, because our last move gave us a reward of 1, we know the average return is ~2.45, so a reward of 1 isn’t that good. Furthermore, we can look at other episodes, and perhaps other games we had much higher average score. So we want to normalize the returns. This can be done by subtracting the discounted rewards by the mean of the batch and dividing by the standard deviation.

Step 3 involves nudging our model to mimic the moves that gave us relatively high scores and discourage the moves that gave us the relatively low scores. Let’s talk a little about the model we’ll use.

The model is similar to the model I described in part 2. It’s a neural network that takes in some input representing the state, multiplies it by some numbers, runs it through some activation function (e.g. relu, or just flooring the output at 0), and comes up with a few numbers that represent how good each action is. In the case of CartPole, our state is represented by 4 numbers, I chose to have 8 neurons, and we can take one of 2 actions.

Our neural network will look like this:

Layer (type)                 Output Shape              Param #     
=================================================================  
input_x (InputLayer)         (None, 4)                 0           
_________________________________________________________________  
dense_1 (Dense)              (None, 8)                 32          
_________________________________________________________________  
out (Dense)                  (None, 2)                 16          
=================================================================  
Total params: 48  
Trainable params: 48  
Non-trainable params: 0  
_______________________________________________________________

We take in a state (4 numbers) and output 2 numbers, one for each action. The final layer we’ll apply a softmax, which simply means that the two numbers will add up to one based on their relative size. So if we the output is [20,30], then the softmax will be [0.4, 0.6]. This is useful because when we use the model to decide what action to take, we can use the final layer as a probability distribution. The final output just represents how strong each move is relative to the other moves.

Now that we have a model to decide what action to make, we have to train that model. In step 2 we tracked the states and relative performance of the action given the state. Now we have to teach our model to predict the good actions and try something else on the bad actions. Our Keras model can learn things on its own, but we have to define what its goal should be. This can be done by writing a custom loss function for our model and Keras will look to minimize that loss. Below is our custom loss function

def custom_loss(y_true, y_pred):  
        log_lik = K.log(y_true * (y_true - y_pred) + (1 - y_true) * (y_true + y_pred))  
        return K.mean(log_lik * adv, keepdims=True)

The log_lik is just a fancy way of asking did I guess correctly? y_true is given, and y_pred is what I guessed.

Consider actual is 0 and I predicted 0.01. Plug in the values into the formula:

y_true = 0, y_pred = 0.01  
log(0 * (0 - 0.01) + (1 - 0) + (0 + 0.01))  
log(0 + 1.01)  
0.00995

So if we’re close, our log_lik value will come out to very close to zero. As we get closer to exact match, the log_lik value will approach zero.

How about when we’re wrong?

y_true = 0, y_pred = 1  
log(0 * (0 - 1) + (1 - 1) + (0 + 1))  
log(1)  
0

From the above, you can see that our log_lik will be between negative infinity (exact match) to zero (completely wrong).

We set our loss to be equal to the log likelihood multiplied by the advantage. So if our model predicted well (log_lik very negative), and our rewards were relatively high (advantage positive), then the loss will be very negative. When we train our model, the negative loss will be good and the model will be adjusted to make this even more negative. On the other hand, if our relative rewards are bad (negative) then our loss will be positive and the model will be adjusted to make that outcome less likely.

Step 4 is finally testing our model. As I described in step 1, we’re injecting some randomness into our learning by choosing the action based on the output of the final layer. So if our final layer is [0.75, 0.25], we’ll take action 1 over action 2 75% of the time. But when it comes time to testing, we should probably just take the higher value (the greedy approach). To see how our model is actually doing, we need to test periodically by acting greedily. When we hit a respectable score, we can stop.

And that’s it.

I’m not going to lie and tell you this is some 100% generalized method that will work out of the box on any environment. There are a lot of decisions we have to make during the process, and unfortunately, the only we can decide is to just take a guess, see how it does and adjust accordingly. The two main parameters that can throw off our learning are:

  1. Model type (number of neurons, layers, activations, learning rate, etc)
  2. Action to reward attribution via discount factor (lower factor leads to less attribution of action to future moves)

And of course we can enter into endless loops, local minima, exploding gradients, or any number of issues that can plague neural networks.

That being said, policy based agents can be useful. They boil down to trying something out, seeing how well it does, and reinforcing the model to do more or less of that. And that’s the core of many other reinforcement learning algorithms.

Well, here’s the final code.

breeko/Simple-Reinforcement-Learning-with-Tensorflow
_Simple-Reinforcement-Learning-with-Tensorflow — Re-write of code from Simple Reinforcement Learning with Tensorflow…_github.com

By Branko Blagojevic on February 12, 2018