I am writing this article while i am trying to create an understanding for myself. I will try to explain the intuition in simple words and I am confident the readers would correct me if something is a miss. Would highly appreciate!
Quite recently , we have seen many explorations and applications coming up , based on Reinforcement Learning. For real life applications we would have to define our own environment and agent, what actions would the agent do, how the states look and so on. To keep it simple and lay out a basic understanding we will utilise an existing environment from Open Ai gym. The environment that we will be using here is Mountaincar-v0. This is a classical game. Open AI Gym also has environments built for complex games such as Atari.
Mountain Car Problem:
In this problem, there is a car between two mountains. The car’s engine is not strong enough to drive up. The challenge here is to take the car to the top of the right mountain (flag is the destination). The car can only move left to right and gain enough momentum to climb up .
The environment provides access to states and the action space for us to interact using the agent. To better understand the environment of Mountain car problem, please refer to this github code.
Action Space: We have a discrete action space here
0: Accelerate to the Left
1: Don’t accelerate
2: Accelerate to the Right
State: (position, velocity) of the car
We are going to define our state a bit different than what the environment provides. We will provide the DQN with difference of the car’s position in form of an image. For every step we will extract the image from the rendered game.
Lets dive into the code:
Replay Buffer: <State,Action,Reward,Next_State>
Agent performs an action leading to a change in the state and in turn the agent is assigned with some rewards or penalties. In our case a reward of 0 is given to the agent (car) if it reaches to the target (location 0.5) or a penalty of -1 if location is less than 0.5 . Living penalties such as this makes sure that the agent tries to take actions with minimal number of steps and reach the final goal. When we design our environments, designing the reward scheme plays a vital role.
The above image is a close representation of how the environment and agent interact with each other. Replay buffer stores these interactions into a tuple. Then these observations act as a fodder for our DQ network. In initial interactions (or episodes) actions are going to be random. After some episodes , our agent can rely on the trained DQ Network predictions.
Push : The new generated tuple of <SARS’> is stored into the replay buffer. After the capacity is overrun, the older observations are overwritten
Sample: This function basically returns random set of observations to train the network. The intuition behind taking random samples is to make sure the network is not trained on correlated or successive <SARS’> tuples. Imagine an agent doing same action for 100 odd subsequent states like driving right (in our use case), the network would heavily reinforce this action and the network parameters get adjusted to fit on these lines. The network predictions may get biased towards a particular action. (for a better explanation of why)
Moreover on the basis of Markovian property, the network should be able to learn without relying on the past states.
Select Action: Epsilon Greedy Way
The action selection could be based on exploration or exploitation.
Exploration: This is when the agent decides the action based on random chances, but up to a certain threshold which is defined by the epsilon threshold. This threshold is decayed at every step taken by the agent. This makes sure that the model eventually stops relying on random actions or explorations and then starts taking action as predicted by the network.
Exploitation: This is when the agent starts taking the action with best possible outcome (as suggested by the network in this case). Further exploration could be added by adding some noise into the DQN’s softmax values or any other final layer.
Unlike what the environment provides us as the state, we wish to pass the visual representation or images as the state values to the network. It may not be optimal to pass the whole rendered window image as the state. We can identify the location of the car and then crop an image of fixed size around it.
I believe the trick here is to correctly identify the car location with respect to the screen. The screen width may vary but the width of the environment which is expressed in terms of units is fixed. In our case:
We multiply (2/3) to screen width as the 0th unit in terms of environment position scale is the (1.2/1.8) of the screen width. This took some hit and trial to determine the formula. I would suggest use the equation and draw lines on the car’s locations to confirm.
int(env.state * scale + (2 / 3) * screen_width)
This is where all the ‘deep action’ lies. Though the network is not that deep, its good enough to help navigate the agent. We will be using a CNN based architecture followed by a linear layer to get output of size action space (3).
conv2d_size_out is just an utility to determine number of neurons after the 3rd convolution, before it goes into a linear layer. The output dimension of the linear layer is the number of discrete actions the agent could take.
Finally the DQN is able to drive the car. This is a video showing a few seconds from an episode.
Full code available here
Lets try to build our own environment in the next article.
The above article is inspired from Pytorch Reinforcement Learning
Link to this article on my blog: https://impatienttechie.com/reinforcement-learning-train-your-own-agent-using-deep-q-networks/