Reinforcement learning is an area of machine learning that involves taking right action to maximize reward in a particular situation. In this full tutorial course, you will get a solid foundation in reinforcement learning core topics.
The course covers Q learning, SARSA, double Q learning, deep Q learning, and policy gradient methods. These algorithms are employed in a number of environments from the open AI gym, including space invaders, breakout, and others. The deep learning portion uses Tensorflow and PyTorch.
The course begins with more modern algorithms, such as deep q learning and policy gradient methods, and demonstrates the power of reinforcement learning.
Then the course teaches some of the fundamental concepts that power all reinforcement learning algorithms. These are illustrated by coding up some algorithms that predate deep learning, but are still foundational to the cutting edge. These are studied in some of the more traditional environments from the OpenAI gym, like the cart pole problem.
⭐️ Course Contents ⭐️
⌨️ (00:00:00) Intro
⌨️ (00:01:30) Intro to Deep Q Learning
⌨️ (00:08:56) How to Code Deep Q Learning in Tensorflow
⌨️ (00:52:03) Deep Q Learning with Pytorch Part 1: The Q Network
⌨️ (01:06:21) Deep Q Learning with Pytorch part 2: Coding the Agent
⌨️ (01:28:54) Deep Q Learning with Pytorch part
⌨️ (01:46:39) Intro to Policy Gradients 3: Coding the main loop
⌨️ (01:55:01) How to Beat Lunar Lander with Policy Gradients
⌨️ (02:21:32) How to Beat Space Invaders with Policy Gradients
⌨️ (02:34:41) How to Create Your Own Reinforcement Learning Environment Part 1
⌨️ (02:55:39) How to Create Your Own Reinforcement Learning Environment Part 2
⌨️ (03:08:20) Fundamentals of Reinforcement Learning
⌨️ (03:17:09) Markov Decision Processes
⌨️ (03:23:02) The Explore Exploit Dilemma
⌨️ (03:29:19) Reinforcement Learning in the Open AI Gym: SARSA
⌨️ (03:39:56) Reinforcement Learning in the Open AI Gym: Double Q Learning
⌨️ (03:54:07) Conclusion
Course from Machine Learning with Phil. Check out his YouTube channel:
Learn to code for free and get a developer job:
Read hundreds of articles on programming:
And subscribe for new videos on technology every day:
welcome to the reinforcement learning jumpstart series I'm your host Phil Taber if you don't know me I'm a physicist and former semiconductor engineer turned machine learning practitioner in this series of tutorials you're gonna learn everything you need to know to get started with reinforcement learning you don't need any prior exposure all you really need is some basic familiarity with Python as far as requirements for this course they are pretty light you will need the opening eye gym because we're gonna be taking advantage of that rather extensively you'll also need the atari extension for that so we can play games like breakout space invaders you'll also need the box 2d extension so we can do the lunar lander environment and beyond that you will need the tensor flow library as well as pi torch and I'm gonna have tutorials in both tensorflow and PI torch with a bit of a stronger emphasis on tensor flow I'm gonna teach a course in somewhat of a top-down fashion meaning we're gonna get to the really important and exciting new stuff like you learning and policy gradient methods first after that we'll kind of back up and take a look at things like sarsa double Q learning and we even get into how to make your own reinforcement learning environments when we code up our own grid world and then solve it with regular Q learning if you missed something in the code don't worry I keep all this code on my github which I will link in the pinned comment down below I'll also link the relevant time stamps for all the material in case you want to jump around because maybe some topics interest you more or you want to get some additional background information from the explainer videos questions comments leave them down below I will address all of them let's get to it in this video you're gonna learn everything you need to know to implement Q learning from scratch you don't need any prior exposure to Q learning you don't even really need much familiarity with reinforcement learning you get everything you need in this video if you're new here I'm Phil and I'm here to help you get started with machine learning I upload three videos a week so make sure to subscribe so you don't miss out imagine you've just gotten the recognition you deserve in the form of offers for a machine learning engineering position from Google Facebook and Amazon all three are offering you a boatload of money and your dreams a big bowling are interrupted by the realization that's starting salary is just well the starting salary you've got friends in each of the three companies so you reach out to find out about the promotion schedules with each Facebook offer is $250,000 to start with a 10% raise after three years but with a 40% probability that you'll quit Google Offers $200,000 to start with the 20% raise after three years but with only a 25% probability that you'll quit amazon offers 350,000 to start with a 5% raise after five years with a 60% chance that you'll end up washing out so what should you take all three of our big money but feature raises are far from certain this is the sort of problem reinforcement learning is designed to solve how can an agent maximize long-term rewards and environments with uncertainties q-learning is a powerful solution because it lets agents learn from the environment in real time and quickly learn novel strategies from mastering the task at hand cue learning works by mapping pairs of states and actions to the future rewards the agent expects to receive it decides which actions to take based on a strategy called epsilon greedy action selection basically the agent spends some time taking random actions to explore the environment and the remainder of the time selecting actions with the highest known expected feature rewards epsilon refers to the fraction of the time the agent spends exploring and it's a model hyper parameter between zero and one you can gradually decrease epsilon over time to some finite value so that your agent eventually converges on a mostly greedy strategy you probably don't want to set epsilon at zero exactly since it's important to always be testing the agents model of the environment after selecting and taking some action the agent gets its reward from the environment what sets Q learning apart from many reinforcement learning algorithms is it performs its learning operation after each time step instead of at the end of each episode as is the case with policy gradient methods at this point it's important to make a distinction traditional Q learning works by literally keeping a table of state and action pairs if you're implementing this in Python you could use a dictionary with state in action tuples as keys the only feasible and environments the limited number of discrete states and actions here the agent doesn't need to keep track of its history since it can just update the table in place as it plays to the game the way the agent updates its memories by taking the difference in expected returns between the actions it took with the action that had the highest possible future returns that sends up biasing the agent's estimates over time towards the actions that end up producing the best possible outcomes when we're dealing with environments that have a huge number of states or state space that is continuous then we really can't use a table in that case we have to use deep neural networks to take these observations of the environment and turn them into discrete outputs that correspond to the value of each action this is called deep cue learning the reason we have to use neural networks is that there are universal function approximate errs it turns out the deep neural Nets can approximate any continuous function which is precisely what we have the relationship between States actions and feature returns is a function that the agent wants to learn so can maximize its future rewards deep Q learning agents have a memory of the states they saw the actions they took and the rewards they received during each learning step the agent samples a subset of this memory to feed these States through its neural network and compute the values of the actions that took just like with a regular Q learning the agent also computes the values for the maximal actions and uses the difference between the two as its lost function to update the weights of the neural network so let's talk implementation in practice we end up with two deep neural networks one network called the evaluation network is to evaluate the current state and see which action to take and another network called the target network that is used to calculate the value of maximal actions during the learning step the reasoning for why you need two networks is a little complicated but basically it boils down to eliminating bias and the estimates of the values of the actions the weight of the target network are periodically updated with the weights of the evaluation network so that the estimates of the maximal actions could get more accurate over time if you're dealing with an environment that gives pixel images just like in the Atari library from the open a I tune then you will need to use a condom neural network to perform feature extraction on the images the output from the convolutional network is flattened and then fed into a dense neural network to approximately the values of each action for your agent if the environment has movement as most do then you have an additional problem to solve if you take a look at this image can you tell which way the ball or panel is moving it's pretty much impossible for you to get a sense of motion from just a single image and this limitation applies to the deep Q learning agent as well this means you'll need a way of stacking frames to give the agent a sense of motion so to be clear this means that the convolutional neural network takes in the batch of stacked images as input rather than a single image choosing an action is reasonably straightforward generate a random number and if it's less than the epsilon parameter pick an action at random if it's greater than the agent's epsilon then feed a set of stacked frames through the evaluation network to get the values for all the actions in the current state find the maximal action and take it once you get the new state back from the environment add it to the end of your stack frames and store the stack frames actions and rewards in the agent's memory then perform the learning operation by sampling the agent's memory it's really important to get a non sequential random sampling of the memory so that you avoid getting trapped in one little corner of parameter space as long as you keep track of the state transitions actions and rewards in the same way this should be pretty safe feed their random batch of data through the evaluation and target networks and the compute the loss function to perform your loss minimization step for the neural network that's really all there is to deep Q learning a couple neural networks a memory to keep track of states and lots of GPU horsepower to handle the training speaking of which of course you need to pick a framework preferably one lets you use a GPU for the learning PI torch and tensorflow are both great choices and both support model checkpointing this would be critical if you have other stuff to do and can't dedicate a day or so for model training that's it for now make sure to share the video if you found it helpful and subscribe so you don't miss any future reinforcement learning content I'll see you in the next video in this tutorial you're gonna learn how to use deep cue learning to teach an agent to play a breakout in the tensorflow framework you don't need to know anything about deep cue learning you don't even need to know anything about tensorflow you just have to follow along let's get started if you're new to the channel I'm Phil a physicist and former semiconductor engineer turned machine learning practitioner here at machine learning with Phil we do deep reinforcement learning and artificial intelligence tutorials three times a week so if you know that kind of thing hit that subscribe button let's get to the video so if you're not familiar with deep cue learning the basic idea is that the agent uses a convolutional neural network to turn these set of images from the game into a set of feature vectors those are fed into a fully connected layer to determine the value of each of the actions given some set of states in this case the set of states is just going to be a stack of frames because we want the agent to have a sense of motion so as we go along we'll be stacking up frames passing them to our network and asking the network hey what is the value of either of the actions move left move right or fire a ball we're gonna slow this into two classes one of which will house the deep Q networks and the other will house the agent and in the asian class we're gonna have other stuff that we'll get to later let's go ahead and start with our imports will need OS to handle model saving will need numpy to handle some basic random functions and of course tensorflow to build our agent so we'll start with our deep Q Network the initializer is pretty straightforward we're gonna take the learning rate number of actions the name of the network that is important because we're gonna have two networks one to select an action one to tell us the value of an action on that later the number of dimensions in the first fully connected layer the input dimensions of our environment so for the Atari Jim sorry the Atari library of the opening I Jim all of the images have 210 by 160 resolution and we're gonna pass in a set of frames to give the agent a sense of motion we're gonna pass in four frames in particular so it's gonna be 210 by default 210 160 by 4 we're gonna do some cropping later on we'll get to that in a minute we also need a directory to save our model so the next thing we need is the tensorflow session this is what instantiates everything into the graph and each network wants to have its own then we'll call the build Network function to add everything to the graph once you've added everything to the graph you have to initialize it very important tensorflow will complain if you don't do that so best to do that now and the way you do that is by calling the TF global variables initializer function other thing we need is a way of saving our models as we go along and this is critical because this deep queue Network takes forever to Train I let it train for about 10 hours and it averages a score of 2 to 3 points per set of whatever number of lies it gets so it's gonna have to train for quite some time so we're gonna want to be able to save it as we go along because we have other stuff to do right and of course you want a way of saving your checkpoint files next thing we need is a way of keeping track of the parameters for each particular network and you do that like this so what this will do is tell tensorflow that we want to keep track of all of the trainable variables for the network that corresponds to whatever the name of this particular network is we use this later when we copy one network to another next let's build our network so we're gonna encase everything in a scope that is based on the age of the network's name we're gonna have placeholder variables that tell us the inputs to our model we're gonna want to input the stack of images from the Atari game we want to input the actions that the agent took as well as the target value for the cue network I'll get to that in a minute and this convention of naming in naming placeholders and layers you're gonna see repeated throughout the tensorflow library the reason is that it it makes debugging easier if you get an error it will tell you the variable or layer that caused the error very handy so you can probably tell by the shape here that we are going to do a one hot encoding of the actions and the same thing for the cue target so the convention of using none as the first parameter the shape allows you to train a batch of stuff and that's important because in virtually every deep learning application you want to pass in a batch of information right in this case we're gonna be passing in batches of stacked frames so we'll get to that in a moment the next thing we have to do is start to build our scroll down a little bit and start building our convolutional neural network so let's start building our layers the first one will have 32 filters kernel size of 8×8 strides a 4 and a name of conf 1 the other thing we need is an initializer now we are going to use the initializer that the deepmind team used in their paper reason is that we want to learn from the experts and may as well do what they do if it's going to make our life significantly easier and that's going to be variance scaling initializer the scale of two and then we want to activate that with a rel u function sorry it's calmed one activated so our next layer is pretty similar it'll take the activated output of the first layer as input that'll take 64 filters if you're not familiar with what a filter is you can check out my video on convolutional neural networks a kernel size of in this case 4×4 strides of – name of com2 and the we can go ahead and copy that initializer why not so that is our second convolutional layer we're gonna do something similar for the third of course and that will take komp2 activated 128 filters to by sorry i 3×3 kernel good grief astride of 1 and a name of conf 3 and the same initializer and of course we want to activate it as well so the next step is once we have the outputs of our convolutional neural network we want to flatten all them and pass them through a dense network to get our Q values or the values of each state action pair let's do that now and that's where our fc-1 dinners come in and we need a value activation and oops we will do the same initializer for the dense layer so next up we need to determine the Q values Q and Q learning just refers to the value of a state action pair it's just the nomenclature and this will be the output of our neural network and of course we want to have one output for each action and this gets the same initializer now we're not activating that yet we want to just get the linear values these are the linear activation of the output of our network so the next thing we need is the actual value of Q for each action and remember actions is a placeholder and next thing we need for every neural network is a loss function so we just want to have the squared difference between the Q value that the network outputs and something called acute target the Q target let's get to that now so the the way Q learning works is that at each time step it's a form of temporal difference learning so every time step it learns and it says hey I took some action what was the maximal action I could have taken and then it takes the delta between whatever action it took and the maximal action and uses that to update the the the neural network as its lost function so our training operation it's just a form of gradient descent atom optimizer in this case this alerting rate and you want to minimize that loss function let's give this more room so that is almost it so that is our network so the next thing we need is a way of saving files right and sape and loading them as well the reason we want this is as we said these models take a notoriously long time to Train and so we may want to start and stop as we go along and what this will do is it will look in the checkpoint file and load up the graph from that file and save it and load it into the graph of the current session and we're gonna say frequently as we train something like every 10 games and all this function does is it takes the current session and opposite to a file pretty handy so that is our deep Q Network what this does again is it takes a batch of images from the environment in this case breakout passes it through convolutional neural network to do the feature selection and it passes it through a fully connected layer to determine the value of each given action and then uses the the maximum value of the next action to determine its loss function and perform training on that network network via backpropagation next up we need an agent that includes everything else all of the learning all the memories all that good stuff so it's gonna take something called alpha that is the learning rate gamma that's a discount factor a hyper parameter of our model the memory size number of actions an epsilon that determines how often it takes a random action a batch size a parameter that tells us how often we want to replace our target Network set of input dims we use the same as before 210 160 by for one moment my cat is whining we need the directory to save the queue next network and we will need a directory to save the cute evaluation and what this as I said will have two networks one that tells us the action to take the other one that tells us the value of that action so let's go ahead and start our initializer so when we take random actions we will need to know the action space which is just the set of all possible actions and we need to know the number of actions we need our discount factor gamma this tells the agent how much it wants to discount future rewards the memory size which tells us how many transitions to store a memory of course our epsilon and Epsilon greedy and then we need our network to tell the agent the value of the next action so we'll pass in the alpha learning rate number of actions input dims the name and the checkpoint directory okay so now we have our two networks the next thing we need is a memory so cue learning works by saving the state action reward and new state transitions in its memory we're also going to save the terminal flags that tell the agent whether or not the game is done that'll go into the calculation of our reward when we do the learning function so we need a state memory just a numpy array of zeros we shape mem size and input dims and so this will save a set of four transitions of four frames stacked four frames by number of memories and we also need an action memory and this will handle the one this will store the one hot encoding of our actions that is just one-dimensional this will just store the agent's memory other rewards and then we need the terminal memory so this just saves the memory of the done flex and to save RAM will say that one is into eight and you know what we can do the same thing with the actions and this is important because we're gonna be saving 25,000 or so transitions on my PC this consumes about 47 gigabytes 48 gigabytes of RAM I have 96 so it fits if you have less you're gonna need a significantly smaller memory size just something to keep in mind so next thing we need to do is to store those transitions in memory and this is of course pretty stateful straightforward so we need the old state the action the reward the new state let's just call that state underscore and a terminal flag so that I'll just be an integer 0 or 1 so the agent has some fixed memory signs we want to fill up to that memory and then when we exceed it we just want to go back to the beginning and start overriding it so the index is just gonna be mem Kontor which is the something I forgot let's put that up here so that will be the counter that keeps track of the number of memories that it has stored so for our actions we need to do the one hot encoding and when we pass in the action it'll just be an integer so making an array of zeros and said in the index of the action you took to one is a one hot encoding and of course you want to increment the memory counter so the next thing we need is a way of choosing actions so dpq learning relies on what is called Epsilon greedy so epsilon is a parameter that tells it how often to choose a random action we're gonna DK epsilon over time the agent will start out acting and purely randomly for many many hundreds of games and eventually the random factor will start decreasing over time and the agent will take more and more greedy actions the greedy action is choosing the action that has the highest value of the next state so this takes the state as input you need a random number from the numpy random number generator and then will select an action at random from the agent's action space if we are going to take a greedy action then we need to actually find out what our next highest leap highest valued action is so we need to use our evaluation network to run the cute eval dot Q values tensor using a feed dict of the sorry the I can't talk and type at the same time of the current state has the Q evaluation network input and then of course if you want the maximum action you just need numpy to arc Max of actions and when you're done just return the action so now we come to the meat of the problem we have to handle the learning so learning has many parts to it the basic idea is first thing we're gonna do is check to see if you want to update the value of our target network and if it's time to do that we're gonna go ahead and do that the next thing we're gonna do is select a batch of random memories the most important thing here is that these memories are non sequential if you choose sequential memories and the agent will get trapped in some little nook and cranny a parameter space and what you'll get is oscillations and performance over time to actually have robust learning you want to select different transitions over the entirety of the memory so that's how you handle memory batching and sampling and then you have to calculate the value of the current action as well as the next maximum action and then you plug that into the bellman equation for the Q learning algorithm and run your update function on your loss so let's go ahead and do that so first thing we want to do is see if it's time to replace our network target network if it is go ahead and do that and we'll write the update graph function momentarily next thing we want to do is find out where our memory ends less than mm size else this will allow us to randomly sample a subset of the memory and this will give us a random choice in the range maximum of size batch size so next we need our state batches these are just sorry these are just the MM state transitions we will need the actions we took and remember we store these as a one hot encoding so we need to go back to a we need to go back to a integer encoding so we need to handle that and the simplest way to do that is just to do a numpy dot operation to just multiply two vectors together so next we need to calculate the values of the current set of states as well as the set of next States sorry sorry the Q eval and the next so this will take these set of next states these transit the transitions the next thing we want to do is copy the Q eval Network because we want the loss for all of the non optimal actions to be zero next thing we need to do is calculate the value of Q target so for all of these states in the batch for the actions we actually took Plus this quantity here so the maximum value of the next state multiplied by this quantity terminal batch so the reason is that if we if the next state is the end of the episode you just want to have the reward whereas if it is not a terminal state the next state then you want to actually take into account the discounted future rewards so next we need to feed all of this through our neural network through our training operation we need the actions which is the action we actually took and we also need the target values which we just calculated so the next thing we need to do is to handle the prospect of decreasing epsilon over time remember epsilon is the random factor that tells the agent how often to take a random action the goal of learning is to eventually take the best possible actions so you want to decrease that epsilon over time so we handle that by allowing the agent to play some number of moves randomly so we'll say a hundred thousand moves randomly and we want to dictate some minimum value because you never wanted to do purely greedy actions because you never know if your estimates are off so you always want to be exploring to make sure your estimates are not off and you can decrease epsilon over time any number of ways you can do it linearly that's what deepmind did you can use square roots you can use any number of things I'm just gonna do this we're gonna multiply it by some fraction of one a bunch of nines that'll give you a really slow decrease of epsilon over time so the agent takes a lot of random actions and does a lot of exploration sorry I flipped my sign there thought that didn't look right so yeah if it tries to drop below 0.01 we're gonna go ahead and set it there okay so that is the learning function next up we have to handle the functions to save the models and this will just call the save check point function for the respective networks next function we need is a way of updating our graphs so what we want to do is we want to copy the evaluation Network to the target Network and this is actually a little bit tricky so what you want are the target parameters this is why we saved them earlier and you want to perform a copy operation on them now the reason this is non-trivial is because you have to pass in a session and the decision of which session to use is non-trivial so you have to use the session for the values that you're trying to copy from not copy to so if you had Q next you would get an error which took me a little bit to figure out so that is our agent class next up we have to code a a main function so of course we have to start again with our imports I'm going to import gin and we want to import a network we will also need numpy to handle the reshaping of the observation we're gonna truncate it to make the workload on the knurl the GPU a little bit lower so important um pi as NP if you want to save the the games you can actually use that with you can do that with the wrappers so I won't put that in here but I will put that in my github version so you can just do a git pull on this and you will have the way of saving the games so first thing we have to do is pre process our observations and the reason you want to do this is because you don't need all of the image we don't need the score and we also don't need color we just need one channel so I've already figured it out if you take row 30 onward and all the columns then you will get a good image and you want to reshape it like so the next thing you have to handle is a way of stacking the frames this is because the agent can't get a sense of motion by looking at only one picture right nobody can get in a sense of motion from looking at one picture worse yet the opening I Atari library returns a sequence of frames where it could be a random number between two three or four so to get a sense of motion we have to actually stack a set of frames and we're gonna handle that with this function so we'll just keep a running stack of frames the current frame to save and the buffer size which just tells you how many frames to save so at the top of the episode you're not gonna have anything to save all right there will be no stacked frames so you want to initialize it so to be the buffer size by the shape of each individual frame next you want to iterate over that and say each row which corresponds to each image in your stack gets assigned to a frame so otherwise it's not the beginning of the episode and you want to pop off the bottom observation shift the set of frames down and append the new observation to the end so instead of one two three four it'll be two three four and then frame five so let's do that equals sorry zero two buffer size minus one so this will shift everything down and this will upend the current frame to the end of the stack next we have to do a reshape and this is basically to make everything play nicely with the neural network all right now we're ready for our main function let's go ahead and save scroll down good grief breakout b0 is the name of the environment this is just a flag to determine if you want to load a checkpoint sorry so yeah epsilon starts out at 1.0 so the agent takes purely random actions our learning rate alpha will be something small by zero zero zero two five and we've reshaped our input so it needs to be 180 instead of 210 180 by 160 by 4 because we're gonna stack four frames the breakout library that's right the breakout game has three actions and a memory size of 25,000 which as I said takes about 48 gigabytes of RAM so go ahead and scale that based on however much you have if you have 16 gigs go ahead and reduce it by down to something like 6 or 7,000 so the batch size tells us how many batches of memories to include for our training we use 32 if flow checkpoint is true then we want to load the models next thing we need is a way of keeping track of the scores we will need a parameter for the number of games stick with 200 to start with a stack size of 4 and an initial score of 0 so the memory is originally initialized with a bunch of zeros that is perfectly acceptable but another option something else we can do is we can overwrite those zeros with actual gameplay sampled from the environment so why don't we do that so and the actions are just chosen randomly right it's just to give the agent some idea of what is going on in the environment so you want to reset at the top of every episode you want to pre-process that observation you want to go ahead and stack the frames and then play your episode so there's a bit of a quirk here the the agent saves its actions as zero one or two but the actions in the environment are actually one two and three so we have to sample from that interval and then add one take the action subtract one and then save the the transition in the memory just a bit of a kludge not a big problem so then we want to add that to our stack go ahead and subtract off one from the action and store that transition in memory and then finally set the old observation to be the new one let's go ahead and use a string print statement okay now that we've loaded up our agent with random memories we can actually go ahead and play the game one thing I like to do is print out the running average of the last ten games so that we get an idea of if the agent is learning over not over time or not you do it like this and then just print that out and I also like to know if epsilon is still won or if it has actually started to decrease over time and we also want to save the models every 10 games and if on any other game we just want to print out the episode number and the score so we can actually scroll up here copy this not done whoops there we go so instead of a random choice it is agent thought choose action and that takes in the observation oh but we did forget to do the same thing here grab those and put them there so the top of every episode we reset the environment pre-process the observation and stack the frame is quite critical so when you still have to do the same thing with adding and subtracting one we want to save the transitions and the only other difference is that we want to learn at every step and at the end of the episode we want to go ahead and append our score so that way we can keep track of our average and if you want to get fancy you can go ahead and add in a plotting function so that when this is done learning you can plop learning over time and you should see an upward slope over time and if you want to plot the epsilon so you should see a gradual downward slope as well so I'm gonna leave it here because the model is still training on my computer it's run about 3,000 iterations 3,000 games that is and it at best it gets two to three points per set of five lives so it's really gonna need a couple days of training to get up to anything resembling competitive gameplay but once it's done I'll upload another video that explains how Q learning works in depth and in detail and I'll include the performance from this particular model in that video so you can check it out then feel free to run this yourself one I finish the model I will go ahead and upload the train version to my github so you are free to download it if you don't have any hard any decent GPUs to train this with go ahead leave a comment down below subscribe I will see you all in the next video you welcome back everybody in this series of videos we're going to code up a deep Q network invite aurash to play space invaders from the Atari open a gym library in this first video we're gonna code the convolutional neural network class and the second video will code up the aging class itself and in the third we'll get to coding the main loop and seeing how it all performs let's get to it so if you're not familiar with the DBQ Network I would advise you to check out my earlier video bite size machine learning what is a DBQ network is a quick little explainer video that gives you the gist of it within about 4 and 1/2 minutes if you already know then great we're ready to rock and roll but if you don't I'll give you the brief too long didn't read so basic idea is that we want to reduce some rather large state space into something more manageable so we're going to use a convolutional neural network to reduce the representation of our state space into something more manageable and that'll be fed into a fully connected neural network to compute the Q value so in other words the action values for any particular given State we're gonna leverage two different networks a network that is going to tell us the value of the current state as well as a network to tell us the value of these successor states this is going to be used to calculate the values of the target and which is the purely greedy action as well as the behavioral network which is the current predicted States the difference between those two is used in the loss function for our stochastic gradient descent or root mean squared propagator so let's go ahead an get started we're gonna start as usual by doing our imports and there are quite a few in pi torch great library one thing I really like about it is that it's highly expressive you don't have a whole lot of boilerplate code as you do in something like tensorflow tens flows enormous lis powerful and if you want to do cross app type of software then that's going to be really your only choice for our purposes this is going to be precise we want to use and of course numpy so we're going to define a class for our deep Q Network and that will derive from an end module this is kind of how my torch handles everything you want to derive your class from the module the base module so that way you get access to all of the goodies and we'll pass in our learning rate alpha for our stochastic gradient descent algorithm all right so the network is comprised of three convolutional layers and two fully connected layers so the first one is just going to be a two-dimensional convolution that's going to take one input channel the reason is that the agent doesn't really care about color so we're going to go ahead and make things easier to reduce our computational requirements by going to a grayscale representation we'll take 32 filters with an 8×8 kernel a stride of four and a padding of one second layer is going to be pretty similar this one of course will take 32 channels in give 64 channels out and do a four by four kernel with a stride of two and our third convolutional layer is going to be of course two-dimensional as well it takes in 64 outputs 128 with a three by three kernel that's it for the convolutions a if you don't know how accomplished a neural network works I'm gonna go ahead and link a video here that will give you the the the basic idea of how those work and if you would rather see how how this stuff works in text form I don't know if I mentioned this earlier but there is an Associated blog article my website neural net thought a I yeah I bought that domain I'll link it in the description go ahead and check it out please so our fully connected layer is the following and we're gonna have to do a little bit of magic here so after running this stuff a bunch of times I know the dimensionality of the images we're gonna of the convolutions we're going to get out and what we want to know is we're going to be taking a set of filters you know convolved images that have had filters applied to them 128 to be exact and we want to unroll them into a single and image right to feed into our neural network so that's gonna be a hundred twenty-eight channels that are nineteen by eight in size and that's a magic number I got from just by running the code and trial and error and then our output layer is going to take in those 512 units and output 6y6 that's because we have a total of six actions in the game of space invaders you have a total of six actions which are the subset of the full twenty available in the atari library basically the agent can do nothing it can move left it can move right it can shoot while standing still and it can move left and right and shoot when we get to the main video go ahead and show all those actions we also need an optimizer equals opt-in Armus prop and we want to tell it to take the parameters of our object for the weights and our learning rate is going to be alpha that we've passed in our loss function is going to be a mean square error loss and the components of that will be the target and the current predicted q functions and of course the target is just the agree D value right because Q learning is an off policy method that uses a epsilon greedy behavior policy to update the purely greedy target policy another thing we have to do in pi torch is tell it which device we're using so t dot device and that'll be cuda 0 if T cuda is available else cpu and what this is telling it is that if the GPUs available go ahead and use it which is of course always preferable otherwise just use the CPU and we also have to tell it to actually send the network to the device maybe an amusing 0.4 I don't know if n 1.0 that's actually required but as of the time I'm coding this up it's required so something we had to live with okay so this is the sum and the whole of our network only thing or that remains to do is to feed forward our observation and that will take the ops no.not observation observation not only can I not type I cannot speak as well I suck at life I don't know why you guys listen to me so what we want to do is the observation vector is the we're going to use a sequence of frames and those frames are the reduced images from the screen so when we get to the third video on on actually playing the game you'll see that you don't need the full frame to actually figure out what's going on you can actually truncate it to get a really good idea in particular the agent can only move left and right a certain amount so we go ahead and truncate the sides you don't need the score at the top because we're keeping track of the score so you don't need to some of the bottom so you can actually truncate the image a little bit to reduce your memory requirements and we're going to convert it into a grayscale so we get rid of to it we're just going to average the three channels to make them into one and we also have to pass in a sequence of frames right because if I only show you one frame of the video game you can't tell what's going on you don't get a sense of motion from a single frame right you need at least two to get a sense of motion we'll be using three I believe in the original implementation of this algorithm for for deep mind they used to form we're gonna go ahead and use three just to be different I haven't I suspect that's a hyper parameter it's not something I played with I encourage you to do that Oh another thing is that we must train this on a GPU if you try to do it on a GPU it's gonna take 7,000 years so if you do not have access to a GPU go ahead and leave a comment below and I will find a way to get my pre trained model to you so you can go ahead and play around with this I have a 1080 Ti not not top of line anymore but it's still quite good for this particular apps application so back to the topic we have to convert our sequence of frames or observation to a tensor and we have also want to send that to the device this is a big heal the era T of Pi torch as far I'm aware I'm not the world's leading expert but as far as I understand it you got to tell it to send the network as well as the variables to the device as well not a big deal just something to be aware of next thing we have to do is resize it so when you are given a sequence of frames in the opening ahjuma and really any other format of image is going to be height and width by channels whereas if you look up here this actually expects the channels to come first and so we have to reshape the array to accommodate that so PI torch is built in phone you for that is view you want a minus 1 to handle any number of stacked frames one channel in the beginning and we're gonna pass in 185 by 195 image that doesn't seem right actually sorry it's not 195 I'm like the original image is 210 by 160 it can't be hundred eighty-five hundred ninety-five total brain fart there it's 100 85 by 95 okay so we have taken in our sequence of frames we've converted it into a tensor and flat and converted it into the proper proper shape for our network the next thing we want to do is activate it and feed it forward so we call the first convolutional layer on our observation and we use the r lu function to activate it and we store that in the new variable observation and then we learn how to type and then we do the same thing with that output pass it through the second convolutional layer and then we do the same thing again we pass through the third convolutional layer and we're almost home free so the next thing we have to do is we are outputting a 128 by 19 by 8 ooh said of arrays next thing we have to do is oh by the way that noise was my Pomodoro timer if you're not using the Pomodoro Technique I highly recommend it it's an enormous productivity hack but anyway next thing we have to do is take our sequence of convolved images and flatten them to feed into our fully connected neural network so begin use the View function to do that not -1 for whatever number of frames we pass in 19 by 8 is what we get out and I found that out through experimentation and we know that's 128 channels because we dictate that here right here 128 output channels so boom we've flattened it and the only thing that remains is to feed it forward and we use a value activation function there FC 1 fully connected layer 1 and then it's not observation but its actions self dot FC two observations so we'll feed it through the final fully convolve layer and store that as a variable called actions and go ahead can return it and the reason we're doing that is because what we're getting out is aq value for each of our actions and we want to pass that back now keep in mind that we're passing in a sequence of frames and so we're gonna get back is a matrix it's not going to be a single array of six values it's gonna be six values times whatever number of rows of images you pass in so if we pass in three images it's gonna have three rows and six columns and that'll be important later when we actually get to choosing the actions speaking of which I'm going to cut it short here we've already I've already rambled for about 12 13 minutes in part two we're going to take a look at coding of the aging class I have structured it this way because the agent actually has to network so it made sense to kind of stick the network in its own class we'll get to coding up the agents init function how to handle its memory how to store transitions how to choose actions and how to actually implement the learning that's a much longer project so we'll stick that in its own video and then in part three we're gonna get to actually coding of the main loop and seeing how it performs hey if you liked the video make sure to like the video hey if you don't like it go ahead the thumbs down I don't really care let me know what you think questions comments leave them down below if you made it this far please consider subscribing I look forward to seeing all of you in the next video welcome back everybody in the previous video we got decoding the convolutional neural network class for our deep Q learning agent that's going to play space invaders if you haven't seen that yet go and click the link to go ahead and watch at first otherwise you probably won't know what's going on if you're the kind if you're the type of person that would prefer to have a written set of instructions go ahead and click the link down below I'll link to the Associated blog article for this particular tutorial series when we last left off we just finished returning the set of actions which is the set of Q values for our sequence of frames so of course in this video we're gonna go ahead and get started coding up the agent class which is where all the magic is gonna happen oh and of course as always I have left the code for this in my github under the YouTube directory it gets its own directory because there's a few different files here I'll link that below as well and if you aren't following me on get up you probably should because that's where I post all my stuff okay next up we have the agent classroom and this is just going to derive from the base object class nothing fancy here we need a basic init function and this is going to take gamma which is our discount factor so the agent has a choice of how to value future rewards in general gets discounted by some value because a reward now is worth more than a reward in the future just like with us you need epsilon for our epsilon greedy action selection the alpha for the learning rate the max memory size a variable to keep track of how low we want epsilon to go something to keep track of how long we replace how often we're going to replace our target Network I'll get to that in a moment or in a few minutes and the action space and that's just a list of variables from 0 through 5 those correspond to all the possible actions for our agents and you just set these to the appropriate variables being careful not to turn on your caps lock key so what these all are are just hyper parameters for our agents we don't need to store alpha because we're just going to pass it in to the network and then never touch it again storing the action space allows us to accommodate the epsilon greedy action selection later the men's size is used for efficiency purposes so we're going to keep track of state action reward transitions you don't want to store an infinite number of them you only want to store a subset you don't need to store all of them anyway there's no more practical benefits as we're just sampling a subset anyway so we just use some rather large memory to keep track of all of the state transitions that we care about keep track of how many steps and the learn step counter that is to keep track of how many times the agent has called the learn function that is used for target network replacement if you're not familiar with it target network replacement is when you swap the parameters from the evaluation network to the target Network my experience has been that it doesn't actually help things so I'm not going to do it I'm going to code it in because it's an important part of the topic but in this particular case I haven't found it to be helpful but I am played with it too much I saw that it does quite well without it so why break it so I'm going to store the memory as a list and the reason I'll use a list instead of a numpy array is because the associated cost of stacking numpy rays is really really high so it's much much faster to store a list of Lists and then convert to a numpy array when you learn and that's much faster than keeping track of a set of an umpire arrays and just stacking them the stack operation is incredibly computationally prohibitive so for something like this that's already computationally expensive doesn't make any sense and keep track of the total number of memories stored so that way you don't overwrite the array we want to keep track of how often we want to replace the target Network and then we need our two networks cue eval and that is just pass in the Alpha that is just the agents estimate of the current set of states and cue next is the agents estimates of the successor set of states so recall in deep Q learning we calculate the value the max value of the successor state as our greedy action that's our our actual target policy and our behavior policy that we use to generate data is epsilon greedy that's it for our constructor the next thing we have to worry about is storing memory transitions so transmission so we are interested in the current state the action taken the reward received and the resulting state because those are the things that that allow the agent to learn so we have to make sure that we aren't over our memory so if the mem counter is less than self dot mem size then just go ahead and append that as a list and again we're doing this because it is much cheaper computationally to append alistun it is to actually stack a numpy array if we have filled up our memory then we want to overwrite the position and memory that is determined by the modulus but men's size so this will guarantee we are bounded by his zero all the way up to our menses and of course this is a list of lists it's just state action or award state underscore and we want to increment the memory counter pretty simple huh pretty straightforward next up we have the function to choose an action and this will take the observation and again just to remind you as we discussed in the first video we were passing in a sequence of observations because we want to capture some information about the temporal dependence of what's going on again with one frame you can't tell if the aliens are moving left or right and so you don't know if you should move left or right of course you know which way your Bo let's go you know which way there is go but as far as movements concerned you need at least one frame so as always we're going to be using numpy to calculate a random number for us to our epsilon greedy action selection and you want to get the value of all of the actions for the current set of states self cute eval dot forward so what we're doing now is for propagating that stack of frames through the neural network the convolutional neural network and the fully connected layer to get the value of each of each of the actions given the you're in some set of states denoted by the observation so if rain is less than 1 minus epsilon then you want to take the Arg max write the maximum action and recall that we have stored the actions as a matrix the return is a matrix because you have the number of rows that correspond to the number of frames you pass in and each of the columns correspond to each of the six actions so you want to take the first first axis and if you're taking a random action then you just choose something at random from the action space list and go ahead and increment your steps and return the action that you've chosen the way I the reason I use one minus Epsilon is you can use the probability epsilon we can use the probability 1 minus Epsilon this is really a more of an soft soft Epsilon kind of strategy rather than purely greedy but it doesn't really matter this is going to give you the probability of choosing the maximum action of epsilon plus epsilon over 6 because there of course the greedy action is the subset of all actions so there's a 1 over 6 probability that when you take the quote-unquote non greedy action you'll actually end up getting the greedy action right next thing we have to worry about is how the agent is going to learn and this is really the meat of everything so we are doing batch learning so you want to pass in a batch size and the reason we're doing batch learning is a number of different reasons so you want to break correlations between state transitions you it's even ok if the the batch overlaps different episodes but you want to get is a good subsampling of the overall parameter space so that way you don't get trapped in the local minima so you randomly sample these state transitions through your memory otherwise you could end up if you were play the whole memory you could end up getting trapped in some local minimum it's a way to improve the efficiency of the algorithm to converge to a purely optimal strategy excuse me Thursday all right first thing we have to do since we're using batch learning is we have to zero out our gradients and what this means is that the grades can accumulate from step to step the network the PI torch library can keep track of it we don't want that if you do that if you don't do the zero grad then you'll end up with basically accumulating for every single batch and that's really full learning rather than batch learning next thing we have to check for is if we're going to replace a target Network so if it's not none and if it's time to do it release now replace target count equals zero then we want to actually replace our target Network what we do there is we take advantage of the PI torts representation of our network in that case it's just we can convert our entire network into a dictionary which is really cool so it's self queue next load States dipped so it's going to load a state to the network from a dictionary which network the evaluation network which we're going to convert to a state dictionary we're not actually going to use this in our employment ation but I included here for completeness next up we want to know we want to select a random sub sample of the memory so want to make sure we don't go all the way past the end of our array so if our current memory counter plus batch size is less than our total memory then we're free to go ahead and select any point of our memory because we know we're not going to go beyond it range otherwise if there is an overlap then we want to mem start to just be an int in the range good grief I hate that minus one just to be safe okay so that is how we choose where to start just andum number somewhere between zero and the max memory if we have enough left over in the memory for the bat to accommodate the batch otherwise subtract that off and select something from that subset then we're gonna go ahead and get that mini batch and convert that to a number right so I suspect this is not the most efficient way of doing this and the reason is that you run into some difficulties in the way in which you pass things into the torch library and as a certain set of expectations that are kind of finicky not to say it's bad it's just something that I wasn't expecting but it works nonetheless so what we want to do next is feed-forward both of our networks we want to know what is the value of our current state and what is the value of the successor state after we take the action whatever action we're looking at in our batch so that's just QE Val forward what are we feeding forward here's where we got some happiness we've got to convert it into a list and the reason is that our memory is a numpy array of numpy objects because the observation vectors are numpy objects so if you don't convert it into a list then you have a numpy array of nd array on Dex and tensor PI torts will complain we don't want tensor you don't want pi torch to complain so we want to access all all rows right our entire batch and the state is just the zeroth element and you want all of the variable you want all of the pixels so you need the second set of : there and we want to send this to the device so if that Q eval dot device and this ensures that the this network this set of variables gets sent to our GPU as well and the next thing we need is the value of the successor States and that is just all the rows third column and all of the all of the members of that array and here I've just no I got us quit using the visual studio code this is annoying but scroll up here so all I have done here is I'm putting it on cue eval dot device because the device for both the evaluation and the next network are the same so I only one GPU by two GPUs then this would matter it doesn't matter so you just call it that next thing we want to know is what is the max action for our current for our next state right because the update rule actually calculates takes into account the purely greedy action for the successor state maybe I can find an image of what I'll flash it here on the screen to make life easy but next thing we have to know is the max action and we use the Arg max function for that and remember we want to take the first dimension because we're the actions queue next to predicted and queue next are actually our actions that's what we get from the feed forward and that is batch size times number of actions and we want the number of actions so that's first dimension and I am a little bit in awe retentive here by sending you to the device just to make sure nothing gets off the device because this stuff slows to a crawl if you run out of the CPU we need to get our rewards and that is obtained from our memory all the rows and the second element 20 parentheses and one thing we want is our loss function to be zero for every action except for that max action so we have Q target equal Q predicted but we want Q target for all of our entire match but the max action to be rewards plus self dot gamma times the actual value of that action so T dot max just gives you the value of the maximum element and Arg max gives you the index and you want to find the maximum action the value of it so those are our target and predicted values that's what we used to update our loss function the next thing we have to handle is the epsilon decrement so we want the agent to gradually converge on a purely a greedy strategy for its behavior policy and the way you do that is by gradually decreasing epsilon over time I don't like to let it simply go to start decreasing right away so I have some set number of steps I let it run first and I use a linear decrease over time you can use exponential quadratic you can use what a reform you want it's not to my knowledge it's not super critical but I could be completely wrong this seems to work as we'll see in the third video so if we don't have enough left then just set it to epsilon end scroll up here and we're almost done so we have everything we need we have everything we need to compute our loss which is this Q target and Q predicted those are the values of Q predicted is the value of the current set of states and Q target is related to Q next right it's the Q target is the max action for the next successor State and so we're almost there so the loss is just the mean squared error loss if you recall from the first video where we define the loss function and as is the torch style you have to send it to the device and then we want to back propagate with lost backward and we just want to step perform one iteration and go ahead and increment our learn step counter and that's basically it so there was a lot of code there let's go back and take a look really quick so first thing you want to do is 0 your gradients so that way we're doing actual batch optimization instead of full optimization then we want to check to see if we're gonna if it's if we are going to replace the target networking if it is time and if it is load the state dictionary from the Q eval on to the Q next network next up calculate the start of the bet of the memory subsampling I'm making sure to get some subset of the array go ahead and sample that batch of memory and converted to an umpire array Oh Pomodoro timer so if you guys aren't using the Pomodoro method to work I highly recommend it go look that up if you don't know what it is but anyway convert that to an umpire array and then go ahead and feed forward the the current state and the successor state you using the memory subsample making sure it is sent to your device next thing we have to know is the maximum action for the successor state and calculate the rewards that the agent was given set the cue target to cue predicted because we want a loss for every state except them the loss for every action except the max action to be zero and then update the value of Q target for the max action to be equal to rewards plus gamma times the actual value of that max action next up make sure that you're using some way of decrementing epsilon over time so it should it converges to some small value that makes the agent settle on an mostly greedy strategy in this clay in this case I'm using 5% of the time for a random action finally go ahead and calculate the loss function back propagated step your optimizer and increment your step counter and that is all she wrote for the learn function and the aging class slightly more code than in the network class but still fairly we have a type of they're still fairly straightforward I hope this has been informative and in part three we're gonna go ahead and code of the main loop and see how all this performs I look forward to seeing you in the next video any comments questions suggestions go ahead and leave them below if you made it this far please consider subscribing I look forward to seeing you all in the next video and welcome back every to part three of coding a deep Q learning agent in the open AI gym atari library and parts 1 & 2 we took a look at the deep neural network class as well as the agent class for our agent and in part three we're going to finally put it all together into the main loop to play the game and see how our agent does let's get started you so we begin as usual with our typical imports we're gonna need the gym of course and we're gonna need to import our or model class so from model on import deep sorry deep queue model and agents and I also have a utility function I'm not going to go over the code issues a trivial function to post the to print the plot the decaying epsilon and the running average of previous five scores from utils import learning and oh and by the way so if you haven't seen parts one and two go ahead check those out and if you want the code for this please check out my github if you would like to see a blog article that details all this in text if you missed something on the speech then go ahead and click the link down below so it's giving me some issue oh it's not deep Q model this is the the downside of talking and typing at the same time I'm not that talented so we want to go ahead and make our environment and the space invaders V 0 another thing to know is that there are implementations of the environment where instead of being passed back an image of the screen you're passed back like a ram image something like that never use it sounds kind of interesting it may be something for you to check out and leave a comment down below if you've played with having experience with it or if you think it sounds cool so we want to make our agent and I'm gonna call it brain big brain pinky in the brain baby gamma 0.95 and epsilon of 1.0 and I'm using epsilon at one point zero because it starts out taking purely random actions and converges on a mostly greedy strategy learning rate of 0.03 max memory size 5,000 transitions and we're not going to do any replacement so you may have noticed when we were building our agent that the memory was instantiated as an empty list if you're going to use an umpire race one thing you would do is just create an array of zeros and the shape of you know your images as well as the total number of memories I'm gonna do something slightly different so one way to help agents learn is to actually have them watch videos of humans playing and in fact the deep mind team taught alpha alpha zero go to play by showing abort configurations and saying which which player one so you it's perfectly legitimate to leverage the experience of humans I'm not gonna play the game for the agent but what I'm gonna do is allow the agent to play the game at totally random it's just gonna play a set number of games to fill up its memory using totally random actions so it's a bit of a hack but I kind of I don't know to me it seems legitimate but some people may frown upon it I don't really care it's how I chose to solve the problem and it seems to work fairly well so brain dam Emma sighs we have to reset your environment reset you're done flag and play an episode so here i'ma let you know the action so zero is no action one is fire two is move right threes move left four is move right fire five is move left fire so that's zero through five total of six actions choose one at random in V that action space sample if you want to verify that these do that go ahead and code up a simple loop you know while loop while not done take action zero and render it and see what it does so next you want to go ahead and take that action and the other thing I do is the and I'm on the fence about doing this I haven't tested it but in my experience with other environments in more basic algorithms that my experience is that it makes sense to penalize the agent for losing so you don't want the agent will naturally try to maximize its score but you want it to know that losing is really bad so if you're done and the and this may be something I change on the github so if you go to the github and see this isn't there it means that I tested it and decided it was stupid but as of right now I think it's okay I'm always open to change my mind though so I want to let it know that losing really really sucks and I want to store that transition and here's a bit of a magical part so as I said in the in the first video in the convolutional neural network we want to reshape it down from three channels into one because the the agent doesn't really care about color it only cares about is an enemy there or not right and I think of the information from black and white so I'm gonna take the mean over the three channels to get down to a single Channel and I'm also going to go ahead and truncate it and the reason is that there isn't a whole lot of information around the borders of the screen that the agent needs so we can get away with reducing our memory requirements without losing anything meaningful and I'm going to go ahead and flash in some images here of what that looks like but what I do is take the ops observation vector and I'm gonna take 15 to 230 to 125 and the mean is performed over access to and we also want to store our action and reward you guys can't see that there so we store the action and reward as well as let's go ahead and copy this we also want to copy we also want to store the successor state oh good grief life is so hard there we go and that's observation underscore which is which is the successor state and then set your observation to observation underscore and then when you're done just let yourself know okay okay and we are almost there so next thing you want to do is keep track of the score so you don't know how well the agent is doing I had this variable Epsilon history Oh keep track of the history of epsilon says the decreases over time because we want to know the relationship between the score and the epsilon and we'll run it for 50 games and we'll take a batch size of 32 memories the batch size is an important hyper parameter but what you find is that using a larger batch size may get you a little bit better performance it slows down training tremendously so on my system within I seven so we need 20 X 32 gigs of ram 10 atti batch size of 32 means that 50 games is gonna run in about a half an hour and it takes quite a bit of time using a larger batch size doesn't seem to produce a whole lot better performance but it certainly slows it down by more than a factor of two so then there's nonlinear scaling there and we want to know that we're starting game I plus 1 with an epsilon of something dot say four significant figures right now it Upsilon and we want to go ahead and append the agents epsilon at the beginning of the episode reset our done flag and reset our environment and okay so the next thing we want to do is as I said why is this unhappy invalid syntax oh now I've offended it what have I done Oh forgot a comma there we go so next thing we want to do is construct our sequence of frames as I said we're going to pass in some sequence of frames to allow it to get some conception of movements in the system so I have a rather ugly way of doing this as usual but the first thing I want to pass into it is the first the first observation vector from the beginning of the game and I have broken something again what have I broken frames none by some oh of course of course okay so the score for this episode is 0 I want to keep track of the last action so this is something I'm not sure about I must confess to you guys the documentation on the open a IgM is rather lackluster what it says is that each action will be repeated for K frames where K's the set two three or four so I guess it gets repeated some random number times so since I don't know how many times and I want to keep passing in a consistent set of observation vectors of frames I will do something hacky so I'll keep track of the last action and I will only update my action every third action so I want as pass in to see two three frames and repeat the action three times I'm kind of forcing the issue again it seems to work I don't think it's the best implementation but this is you know just my quick and dirty implementation so if we have three frames go ahead and choose an action based on those and reset your frame variable to an empty list otherwise go ahead and do what you just did scroll down so then we want to go ahead and take that action keep track of our score and append our new observation I'm just gonna copy that yep copy that and then I am going to go ahead and tell it that losing is very bad we don't like losers and this l dot lives thing is the number of lives the agent has kale is just the emulator in which the opening at our opening items Atari libraries built and next we have to store a transition I'm going to copy that code precisely because I have fat fingers and apparently screw things up so copy that and then underscore brain learn batch size keep track of our action and here you can put in a render if you want to see what it's doing if not then at the end of the episode end of the episode you know to append a score which we're going to plot later and I like to print the score so that I can kind of get an idea of how the agent is doing and we need to make a list of the X variable for plotting function again for the plotting function just go ahead and check out my github for that that's the easiest way and I'm gonna set a file name I'm just gonna call it test for now plus strain num game something like that plus dot PNG so we have a file name and then we're gonna plot that we want to plot the score is the epsilon history and the file and pass in the file name so that it saves it and that's all she wrote for the for the main loop I've gone ahead and run that so I'm gonna go ahead and flash in the results here so as you can see the epsilon decreases somewhat linearly over time not somewhat completely linearly and as it does so the agents performance gradually increases over time and keep in mind here I am plotting the previous the average of the previous five games the reason I do that is to account for significant variations in game to game play right so the agent is always gonna choose some proportion of random actions and that means it can randomly choose to move left or right into the enemy's bullets so there's always some games that it's gonna get cut short so the general trend in the performance is up up until the very end when it actually takes a bit of a dive and I've seen this over many many different iterations I suspect this has to do with the way that it is navigating through parameter space so to find pretty good pockets and then kind of shift into a related other local minima what isn't quite as good if you let it run long enough this will eventually go back up but it does have some a Silla Tory behavior to it but you can see that it increases its score quite dramatically and in this particular set of runs I saw scores in excess of 700 points which is actually pretty good for an agent so let's go ahead and take a look at what it looks like with the target network replacement so here you can see a dramatically different behavior so in this case epsilon decreases and the score actually decreases over time and you know actually I don't quite know why it does this so the the oscillations down there are almost certainly from the target Network replacements it could be a fluke but I have run this several times where I see this type of behavior where with the target never replacement it totally takes a nosedive I don't think I screwed up my implementation please leave a comment below if he saw something it looks off but as far as I can tell it looks it's implemented correctly just copy one state dick to another nor a big mystery there but that's why I choose to leave it off you get a significant variation and performance more of the stories to go ahead and leave the target network replacement off and that's it for this series so we have made an agent to play the Atari Atari game of space invaders by gradually decreasing epsilon over time we get really good performance several hundred points in fact actually learns how to play the game quite well probably going to go ahead and spin in a video of it playing here so you can see how it looks if this has been helpful to you please consider subscribing go ahead and leave a comment below if you have one a question suggestion anything go ahead and I answer and read all my comments and go ahead and smash that like button guys so I hope to see you all in the next video and take care in this video I'm going to tell you everything you need to know to start solving reinforcement learning problems with policy gradient methods I'm gonna give you the algorithm and the limitation details upfront and then we'll go into how it all works and why you would want to do it let's get to it so here's a basic idea behind policy gradient methods a policy is just a probability distribution the agent uses to pick actions so we use a deep neural network to approximate the agents policy the network takes observations of the environment as input and outputs actions selected according to a softmax activation function next generate an episode and keep track of the state's actions and rewards in the agent's memory at the end of each episode go back through these States actions and rewards and compete and compute the discounted future returns at each time step use those returns as weights and the actions the agent took as labels to perform back propagation and update the weights of your deep neural network and just repeat until you have a kick-ass agent simple yeah so now we know the what let's unpack how all this works and why it's something worth doing remember with the reinforcement learning we're trying to maximize the ages performance over time let's say the agents performance is characterized by some function J and it's a function of the weights theta of the deep neural network so our update rule for theta is that the new theta equals the old theta plus some learning rate times the gradient of that performance metric note that we want to increase performance over time so this is technically gradient ascent instead of gradient descent the gradient of this performance metric is going to be proportional to some overstates for the amount of time we spend in any given state and a sum over actions for the value of the state action pairs and the gradient of the policy where of course the policy is just the probability of taking each action given we're in some state this is really an expectation value and after a little manipulation we arrive at the following expression when you plug that into the update rule for theta you get this other expression there are two important features here this G sub T term is the discounted feature returns we referenced in the opening and this grading of the policy divided by the policy is a vector that tells us the direction in policy space that maximize the chance that we repeat the action a sub T when you multiply the two you get a vector that increases the probability of taking actions with high expected future returns this is precisely how the agent learns over time and what makes policy gradient methods so powerful this is called the reinforce however by the way if we think about this long enough some problems start to appear for one it doesn't seem very sample efficient at the top of each episode we reset the aeneas memory so it effectively discards all its previous experience aside from the new weights that parameterize its policy it's kind of starting to a scratch after every time it learns worse yet if the agent has some big probability of selecting any action in any given State how can we control the variation between the episodes for our state spaces lots are way too many combinations to consider well that's actually a non-trivial problem of policy gradient methods then part of the reason our agent wasn't so great at Space Invaders obviously no reinforcement learning method is going to be perfect and we'll get to the solution to both of these problems here in a minute but first let's talk about why we would want to use policy grades at all given these shortcomings the policy gradient method is a pretty different approach to reinforcement learning many reinforcement learning algorithms like deep q-learning for instance when I am estimating the value of a state or state action pair in other words the agent wants to know how valuable each state is so that it's Epsilon greedy policy can let it select the action that leads to the most valuable states DG repeats this process over and over occasionally choosing random actions to see if it's missing something the intuition behind Epsilon greedy action selection is really straightforward figure out what the best action is and take it sometimes do other stuff to make sure you're not entirely wrong okay that makes sense but this assumes that you can accurately learn the action value function to begin with in many cases the value or action value function is incredibly complex and really difficult to learn on realistic time scales in some cases the optimal policy itself may be much simpler and therefore easier to approximate this means the policy grading agent can learn to beat certain environments much more quickly than if it relied on an algorithm like deep Q learning another thing that makes policy gradient methods attractive is what if the optimal policy is actually deterministic and really simple environments with an obvious deterministic policy like our grid world example keeping a finite epsilon means that you keep on exploring even after you found the best possible solution obviously this is optimal for more complex environments the optimal policy may very well be deterministic but perhaps it's not so obvious and you can't guess at it beforehand in that case one could argue that deep Q learning would be great because you can always decrease the exploration factor epsilon over time and while the agent to settle on a purely greedy strategy this is certainly true but how can we know how quickly to decrease Epsilon the beauty of policy gradients is that even though they are stochastic they can approach a deterministic policy over time actions that are optimal will be selected more frequently and this will create a sort of momentum that drives the agent towards that optimal deterministic policy this really isn't feasible in action value algorithms that rely on Epsilon greedy it's variations so what about it shortcomings as we said earlier there are really big variations between episodes since each time the agent visits the state it can choose a different action which leads to radically different future returns the agent also doesn't make very good use of its prior experience since it discards them after each time it learns while this seem like showstoppers they have some pretty straightforward solutions to deal with the variance between episodes we want to scale our rewards by some baseline the simplest baseline to use is the average reward from the episode and we can further normalize the g-factor by dividing by the standard deviation of those rewards this helps control the variance in their returns so that we don't end up with wildly different step sizes only when we perform our update to the weights of the deep neural network dealing with the sample and efficiency is even easier while it's possible to update the weights of the neural net after each episode nothing says this has to be the case we can let the agent play a batch of games so that it has a chance to visit a state more than once before we update the weights for our network this introduces an additional hyper parameter which is the batch size for our updates but the trade-off is that we end up with a much faster convergence to a good policy now it may seem obvious but increasing the batch size is what allowed me to go from no learning at all in Space Invaders with policy gradients to something that actually learns how to improve its gameplay so that's policy gradient learning in a nutshell we're using a deep neural network to approximate the ages policy and then using gradient ascent to choose actions and results in larger returns it may be sample inefficient and have issues with scaling the returns but we can deal with these problems to make policy gradings competitive with other reinforcement learning algorithms like deep view learning if you've made it this far check out the video where I implement policy grades in tensorflow if you liked the video make sure to like the video subscribe comment down below and I'll see you in the next video what's up everybody in this tutorial you're gonna learn how to land a spaceship on the moon using policy gradient methods you don't need to know anything about reinforcement learning you don't need to know anything about policy gradient methods you just have to follow along let's get started so before we begin let's take a look at the basic idea of what we want to accomplish policy gradient methods work by approximating the policy of the agent the policy is just the probability distribution that the agent uses to select actions so we're going to use a deep neural network to approximate that probability distribution and we're going to be feeding in the input observations from the environment and getting out a probability distribution as an output the agent learns by replaying its memory of the rewards it received during the episode and calculating the discounted future rewards that followed each particular time step those discounted feature rewards act as weights in the update of our deep neural network so that the agent assigns a higher probability to actions whose feature rewards are higher so we'll start with our imports and we're gonna want oh s to handle file operations numpy to handle numpy type our operations I mean of course tensorflow to develop our agent we're gonna stick everything in one class whose initialize function takes the learning rate the discount factor gamma the number of actions for the Lunar Lander environment that's just for the size of the first hidden layer of the neural network will default that to 64 the size of the second hidden layer of the deep neural network will also default that to 64 we also need the input dims and in this case that is 8 so the observation isn't a pixel image it is just a vector that represents the state of the environment we also want a checkpoint directory and this will be useful for saving our model later so this action space will be what we use to select actions later on and we're gonna need a number of other administrative type stuff so for instance the state memory is just what the agent will use to keep track of the states it visited likewise for the action memory we want to keep track of the actions the agent took and the rewards it received along the way we also need the layer one size and the layer 2 size what else we okay so now we can move on to the administrative stuff with tensor flow so tensor flow handles everything in what are called sessions so we have to define one of those we're gonna need a function to build the network and when we return from the network we're gonna want to initialize all of the variables so this build Network function is only called once it will load up all of the variables and operations under the tensor flow graph and then we have to initialize those variables with some initial values which will be done at random we also need a way of saving the model because your PC may not be fast enough to run this in a you know short amount of time so you can do this in chunks by saving it and then reloading it as you have time we also need a file to save the check points in so all right next order of business is to actually construct this network so we don't need any inputs for that but we do need a set of placeholders and the placeholders serve as placeholders for our inputs it just tells the tensorflow graph that hey we're gonna be passing in some variables we don't know what they are yet we may not necessarily even know their shape or size but we know what type they are and we want to give them names so if something goes wrong we can debug later so this input will just be the 8 element vector that represents the agents observation of the environment and if you're not really familiar with tensorflow this idiom of saying shape equals a list whose first element is none tells tensorflow that we do not know the batch size of the data we're going to be loading so the inputs could be 10 states it could be a hundred it could be a thousand it could be any number of states and so that none just tells tensorflow hey we don't know what shape it's gonna be so you know just take whatever and this label is going to be the actions the agent took during the course of the episode this will be used in calculating our loss function similarly we have something called G and G is just the generic name for the agents discounted future rewards following each time step this is what we will use to bias the agents loss function towards increasing the probability of actions who generate the most returns over time so now we're going to construct our network so the first layer will just be the input and the number of it'll take a course number of input and input dims on input and then output the layer one size and the activation is just going to be a row u function and then we have to worry about initializing this so when we call the TF global variables initializer it's going to initialize all of the layers and variables with some values we can dictate how it does that and we can think very carefully about it so if you've dealt with deep neural networks in some cases you can get vanishing or exploding gradients that cause the values of the network predicts to you know go to you know either really large or really small values it produces junk essentially there is a function that will initialize the values of all the layers such that they are relatively comparable and we have a minimal risk of that happening and that's called the Xavier initializer and I'm gonna want to copy that because we're going to use that more than once so of course the second layer just takes the first layers input as l2 size as the number of units with a value activation and of course the same initializer and l3 will be the output of our network however we don't want to activate it just yet so this quantity is related to the probability sorry the policy of the agent but the policy must have the property that the sum of the probabilities of taking an action you know the sum of the probabilities for all of the actions must equal 1 and it's the softmax function that has that property and we're going to separate it out so that we can just so that it is a little bit more clean in terms of the the code a little bit more readable for us but the self dot actions variable is what will actually calculate the probabilities of selecting some action and that is just the softmax of the l3 and the name of that is probabilities finally we need to well not finally but next we need to calculate the loss function so you'll stick that in a separate scope and what we need is the negative log probability so excuse me the calculation involves the natural log of the policy and you want to take the gradient of the natural log of something so we need a function that matches that property so call it neg log probability and that's the sparse softmax cross-entropy with log it's try saying that five times fast so log that's is just l3 and this is important because this is part of the reason we separated out l3 inactions because when you're passing in the log gets you don't want it to already be activated because the negative the spar softmax cross-entropy will handle the softmax activation function and the labels will just be the label that we pass in from the placeholder and then of course be the actions the agent took and so the loss is then that quantity negligibly multiplied by the returns gene next we need the training operation and that of course is just our a great descent type algorithm in this case we'll use the atom optimizer the learning rate of whatever we dictate in the constructor and we want to minimize that loss so that is the sum and whole of the deep neural network where we need to we need to code the next question we have is how can we select actions for the agent so the agent is attempting to model its own probability distribution for selecting actions so that means what we want to do is take a state as input pass it through the network and get out that probability distribution at the end given by the variable self dot actions and then we can use numpy to select a random action according to that probability distribution and we're just going to go ahead and reshape this bill at ease so when you run an operation through the tensorflow graph you need to specify a feed dictionary which gives the graph all of the input placeholder variables it's expecting so in this case it wants self dot input and that takes state as input and it's going to return a tuple so we're going to take the zeroth element as so we can get the correct value that we want and then we can select an action according to an umpire random choice from the action space using the probabilities as our distribution and we just returned that action the next problem we have to deal with is storing in action sorry the transitions and of course will store the states and action and reward and this will be useful when we learn and since we're using lists here we're just going to append the elements to the end of the list next we come to the meet of the problem the learning function and this doesn't require any input so the basic idea here is that we are going to convert these lists into numpy array so we can more easily manipulate them and then we're going to iterate through the agents memory of the rewards that received and calculate the discounted sum of rewards that followed each time step so we're going to need two for loops and a variable to keep track of the sum as well as something to keep track of our discounting so let's deal with the memories first that won't work will it there we go so now will instantiate our G factor and get in shape of reward memory we will iterate over the length of that memory which is just length of our episode and so for each time step we want to calculate the sum of the rewards that follow that time step is that correct yes next we take the sum and discount it then of course the discount is just the gamma to the K where K is our time step so then of course the rewards following the teeth the teeth tongue teeth the tea time step is just G some so that's the some that's the way to dis kind of rewards the next thing we have to think about is these rewards can vary a lot between episodes so to promote stability in our algorithm we want to scale these results by some number it turns out that a reasonable number to use is the mean so we're gonna subtract off the mean of the rewards the agent received during the episode and then divided by the standard deviation and this will give us some nice scaled and normalized numbers such that the algorithm doesn't produce wonky results and since we're dividing by the standard deviation we have to account for the possibility that the standard deviation could be zero hence the conditional statement next we have to run our training operation so the underscore tells us we don't really care about what it returns well we have to run the training operation with the feed dictionary that'll take an input which is just our state memory it will take the labels and that is just the action memory and it will take the G factor which is just the G we just calculated now at the end of the episode once we finish learning we want to reset the agent's memory so that rewards from one episode don't spill over into another so there are two more functions we need these are administrative we need a way to load the checkpoint and we'll go ahead and print out loading checkpoint just so we know it's doing something and then we have the saver object and we want to restore a session from our checkpoint file next we want to save a checkpoint we're not gonna print here I take it back reason is that we're gonna be saving a lot and I don't want it to print out a bunch of junk statements so and what these are doing is just either taking the current graph as it is right now and sticking it into a file or conversely taking the graph out of the file loading it into the graph for the current session so that is that next we can move on to the main program to actually test our Lander so we'll need Jim and I didn't I don't know if I made it clear at the beginning but you'll also need the box2d – PI environment so go ahead and do a pip install box2d – pi if you don't already have that so we need to import our policy gradient agent I have this plot learning function I always use you can find it on my github along with this code of course I don't elaborate on it but basically what it does is it takes a sequence of rewards keeps track of it performs a running average of say the last 20 25 whatever amount you want and spits that out to a file we also have a way of saving the renderings of the environment so it runs much faster if you don't actually render to the screen while it's training but you can save those renderings to files in mp4 files so you can go back and watch them later it's how I produce the episodes you saw at the beginning of this video so first thing you want to do is instantiate our agent and we use a learning rate of zero three zeros and a five and I believe all all will need a gamma be something like 0.99 and all and then all the other parameters will just leave at the defaults next we need to make our environment and that's Lunar Lander v2 a way to keep track of the scores the agent received our initial score and number of episodes so I achieved pretty good results after 2500 episodes so we can start with that so the first thing we want to do is iterate over our episodes oh let me do this for you so before we do that I'll comment this out but if you want to save the output you do this env Eagles rappers dot monitor passing the environment wherever you want to save it lunar lander and then you want to use a lambda function to tell it to render on every episode and this force equals true I believe that just tells it to overwrite if there's already data in the directory so at the top of every episode just print out what episode number you're on and the current score you set you're done flag and for subsequent episodes you'll want to reset the score reset your environment into play and episode first thing you want to do is select an action and that takes the observation as input so we need to get the new observation reward done flag and info by stepping through the environment once that's done you want to store the transition set the old observation to be the new one so that you select an action based on the newest state and keep track of the reward you received at the end of every episode you want to append the score to the score history and perform our learning operation you also want to save a checkpoint after every operation ever after every learning operation and then when you're done dictate some file name dot PNG and call my super secret plot learning function that just takes the score history file name and a window that tells it over how many games you want to take the running average so now let's go to our terminal and see how many mistakes I made one second all right so here we are let's go ahead and give it a try so I made some kind of error in the policy gradient agent let's swing back to that file and see where it is one second so here's the model it does not have input dims that's because I forgot to keep track of it save go back to our terminal and see how it does and I apparently called something I should not have this is line 27 says self thought label placeholder got an unexpected keyword argument named label ah that's because it is called name that's what happens when I try to type and talk at the same time so let me make sure I didn't call L one size L to size something different no I did not all right I will run that again so now it's unhappy about the choice of oh of course so it's looser there's no dash in there get rid of that typos galore tonight oh really oh really ah it's store transitions yes so it's actually store transitions so let's call it that and there we go perfect so now it's actually learning I'm not gonna sit here and wait for this to do 2500 games because I've already run this once before that's how I got the code II saw at the beginning so I'm gonna go ahead and show you the plot that it produces now so you can see how it actually learns so by the end it gets up to about an average reward of around 200 and above points that's considered salt if you check the documentation on the open a item so congratulations we have solved the lunar lander environment with the policy gradient algorithm we relatively simple just a say deep neural network that calculates the probability of the agent picking an action so I hope this has been helpful go ahead and leave a comment down below feel free to take this code from my github for kit make it better do whatever you want with it I look forward to seeing you all in the next video welcome back everybody to a new reinforcement learning tutorial in today's episode we're going to to play space invaders using the policy grading method let's get to it we're imports we start with the usual suspects numpy in tensorflow we start by initializing our class for the policy gradient agent we take the learning rate discount factor number of actions number of layers for the fully connected layer the input shape channels a directory for our checkpoints very important as well as a parameter to dictate which GPU we want tensorflow to use if you only have one GPU or using the CPU you don't need that parameter save the relevant parameters and compute the action space which is just a set of integers we want to subtract out the input Heights and width from the input shapes and keep track of the number of channels for use later our agents memory will be comprised of three lifts that keep track of the state action and rewards we want a configuration which just tells tensorflow which GPU we want to use as well as our accession graph that uses that config we call the build Network function and then use the TF global variables initializer we need to keep track of a saver and a checkpoint file for saving and loading the model later now if you don't know anything about policy grading methods don't worry we're gonna cover everything you need to know as we go along the first thing we're gonna need is a convolutional neural network to handle image pre-processing that'll be connected to a fully connected layer that allows the agent to estimate what action it wants to take in contrast to things like deep Q learning policy gradient methods don't actually try to learn the action value or value functions rather policy gratings try to approximate the actual policy of the agent let's take a look at that for our build Network function we're going to construct a convolutional neural network with a few parameters an input placeholder that has shaped a batch size by input height and width as well as the number of channels we will also need an input for the labels which just correspond to the actions the agent takes that has shaped batch size and a factor G which is just the discounted future rewards following a given time step that has shaped back size our convolutional neural network is going to be pretty straightforward if you've seen any of my videos you can kind of know what to expect the first layer has 32 filters a kernel size of 8 by 8 and astride of 4 and we're going to want to use an initializer for this I found that the Xavier initializer works quite well the purpose of it is to keep the to initialize all the parameters in such a way that the they the network doesn't have one layer with parameters is significantly larger than any other we want to do batch normalization I mistyped the epsilon there should be one by ten to the minus five and you want to of course use a value activation on the first convolutional layer in that CommVault that activated output serves as the input to the next layer with 64 filters kernel size of four by four and astride of two and again we'll use the Xavier initializer of course we also want to do batch normalization on this layer and at this point I think I actually get the correct epsilon for the batch normal and of course we want to use a rel u activation on that batch normed output as well our third convolutional layer is more the same 2d convolution with a hundred twenty eight filters it will have a kernel size of two by two a stride of one and we'll also use the same initialization again batch normalization with an epsilon of 1 by 10 to the minus 5 activate with our Lu next we have to take into account our fully connected layers so the first thing we need to do is flatten the output of the convolutions because they come out as matrices we need a list go ahead and make the first fully connected layer using FC one as the number of units ryu activation and the second dense layer is going to have units equal to the number of actions for the agent which in this case is just 6 notice that we don't activate that but we have a separate variable for the actions which is the activated output of the network and we're gonna use a soft Mac so that way we get two probabilities that add up to one need to calculate the negative log probability with a sparse softmax cross-entropy function with log it's using dense to is our log 'its and labels as the actions as our labels the loss is just that quantity multiplied by the expected feature rewards and of course we want to reduce that quantity now for the training operation we're going to use the rmsprop optimizer with a set of parameters I found these to work quite well this algorithm is pretty finicky so you may have to play around the next thing we have to do is code up the action selection algorithm for the agent and plessy gradient methods we're trying to actually approximate the policy which means we're trying to approximate the distribution by which the agent chooses actions given it's in some state s so what we need is a way of computing those probabilities and then sampling them sampling the actions according to those probabilities we choose an action by taking in an observation reshaping it of course this will be a sequence of frames for hi and you want to calculate the probabilities associated for each action given that observation and then you want to sample that probability distribution using the numpy random choice function next up we have to take care of the agency so we're gonna store the observation action and reward and the agents lists using a simple append function so one big problem we're gonna have to solve is the fact that policy gradient methods are incredibly sample inefficient these Monte Carlo methods meaning that at the end of every episode the agent is learning so it throws away all of the experience it required in Prior episodes so how do we deal with that well one way to deal with that is to actually queue up a batch of episodes and learn based on that batch of experiences the trouble here is that when we take the rewards that follow any given time step we don't want to account for rewards following the current episode so we don't want rewards from one episode spilling over into another so we have to take care of that here next up we handle the learning for the agent we want to we want to convert the state action and reward memories into arrays so that we can feed them into the numbers use me tensorflow learning function into these sorry the tensor flow graph we have to start by reshaping the state memory into something feasible and then we can calculate the expected feature rewards starting from any given state so what we're gonna do is iterate over the entire memory and take into account the rewards the agent receives for all subsequent time steps we also need to make sure that we're not going to take into account rewards from the next episode next up we have to scale the expected feature rewards this is to reduce the variance in the problem so this makes sure we don't have really really large rewards so everything's just kind of scaled next up we call the training operation with an appropriate feed dict of the state memory action memory as labels and the G for our G variable finally since we're done we're going to clear out the agents memories finally we just have some bookkeeping functions to load and save checkpoints you just call the saver restore function that loads the checkpoint file into the session and the save checkpoint just dumps the current session into the checkpoint file another problem we have to solve is that the agent doesn't get a sense of motion from only a single image right if I show you a single image you don't know if the aliens are moving left or right or really you don't know which direction you're moving so we have to pass in a sequence of frames to get a sense of motion for our agent this is complicated by the fact that the open anti Jim Atari library in particular returns a set of frames that are repeated so if you actually cycle through the observations over time you'll see that the frames change I sorry don't change based on an interval of one two or three so we had to capture enough frames to account for that fact as well as to get a overall sense of movement so that means four is going to be our magic number four so for stacking frames next we move into the main program we import Jim numpy our model as well as the plot learning function which is a simple utility you can find on my github as well as the wrappers to record the agents gameplay if you so choose we need to pre-process the observation by truncating it and taking the average next up we stack the frames so at the beginning of the episode the stack frames will be none so you want to initialize an empty array of zeros and iterate over that array and set each of those rows to be the current observation otherwise what you want to do is you want to pop off the bottom observation shift everything down and put the fourth spot or the last spot to be the current observation down in the main function we have a checkpoint flag if you want to load a checkpoint we want to initialize our agent with this set of hyper parameters I found these to work reasonably well you can play around with them but it is incredibly finicky so your mileage may vary we need a file name to save our plots we'll also want to see if we want to load a checkpoint next we initialize our environment space invaders of course keep track of a score history score number of episodes and our stack size of 4:1 iterate over the number of episodes resetting the done flag at the top of each episode we also want to keep track of the running average score from the previous 20 games just so we got an idea if it's actually learning every 20 games we're gonna print out the episode score and average score otherwise every other every other episode we're just going to print out the episode number and the score reset the environment and of course you have to pre-process that observation and then go ahead and set your stack frames to none because we're at the top of the episode and then call the stack frames function so that we get four of the initial observation so the scored is zero and start iterating over the episode so this first episode choose an action based on that set of stack frames go ahead and take that action and get your new state action and reward go ahead and pre process that observation so that way you can stack it on the stack of frames next up you have to take care of the agent's memory by storing that transition and finally you can't increment your score or save the score at the end of the episode and every 10 games are gonna handle learning and saving a checkpoint and when you're all done go ahead and plot the learning now the agent is done we can take a look at the actual plot it produces over time this is how you know an agent is actually learning what you'll see is that there is some increase in the average reward over time you'll see oscillations up and down and that's perfectly normal but what you want is an overall upward trend now for this particular set of algorithms it is notoriously finicky with respect to learning rates I didn't spend a huge amount of time tuning them or playing with them I just wanted to get something good enough to show you guys how it works and turn over to your capable hands for fine tuning but what you do see is a definite increase over time as his average reward improves by about a hundred points or so that isn't going to win any awards but it is definitely at a clear and unequivocal sign of learning so there you have it that was policy gratings in the Space Invaders environment from the opening I Jim I hope you learn something make sure to check out this code on my github you can fork it you can copy it you can do whatever you want with it make sure to subscribe leave a comment down below if you found this helpful I look forward to seeing you all in the next video welcome back everybody to neural net dot AI I am your host Phil Taber previously a subscriber asked me hey Phil how do I create my own reinforcement learning environment I said well that's a great question I don't have time to answer it in the comments but I can make a video so here we are what we're going to do in the next two videos is create our own open AI Jim compliant reinforcement learning environment the grid world it's going to be text-based and if you're not familiar with it the grid world is aptly named a grid of size M by n where the agent starts out in say the top left and that's to navigate its way all the way to the bottom right the twist on this is going to be that there will be two magic squares that cause the agent to teleport across the board the purpose of doing this is to create a shortcut to see if the agent can actually learn the shortcut kind of interesting the agent receives a reward a minus one with each step except for the terminal step or receives a reward of 0 therefore the agent will attempt to maximize its reward by minimizing the number of steps it takes to get off the grid world what other things to other concepts we need are the concept of the state space which is the set of all states – the terminal state and the state space plus which is e set of all states including the terminal State this gives us a bit of a handy way to find out the terminal state as well as to find out if we're attempting to make illegal moves it also follows the nomenclature and terminology from the Sutton Bardot book reinforcement learning which is an awesome resource you should definitely check out if you have not already so in part 1 we're gonna handle the environment and in part 2 we're going to get to the main loop and the agent for which we will use cue learning not deep cue learning because this is a very straightforward environment we don't need a functional approximation we just need the tabular representation of the agents estimates of the action value function so if you're not familiar with q-learning I do have a couple videos on the topic one where the q-learning agent solved the card poll game as well as a an explainer type video that talks about what exactly q-learning is so let's go ahead and get started we only have a couple dependencies we're not doing anything on the GPU so just numpy and Matt plot line we want to close everything up into a class called grid world and our initializer will take the M&N which is the shape of the grid as well as the magic squares so we represent our grid as an array of zeros and shape and by n we want we want to learn to type we want to learn to sorry we can want to keep trying we want to keep track of the M and the N for handy use later so let's go ahead and define our state space and that's just going to be a list comprehension for all the states in the range self dot M times self dot n now the as I said the state space does not include the terminal state and the terminal state is the bottom right so we have to go ahead and remove or pop off that particular state from the list next up so now let's go ahead and again learn to type go ahead and define our state space plus also we need to know the the way that the actions map up to their change on the environment so we'll call that the action space a little bit of a misnomer but we can live with it for now so moving up will translate the agent up one row which is distance M and moving down will advance the agents position downward by also M moving the left will translate the agent one step will decrement the agents position by one and moving right will increase it by one we also want to keep track of the set of possible actions you could use the keys in the action space dictionary but let's go ahead and use a separate structure and we use a list up down left and right the reason is that the q-learning agent ik you learning algorithm sorry can it can choose actions at random so it's handy to have a list from which you can choose randomly next we need to add the magic squares because that's a little bit more complicated than it may seem and finally when we initialize the grid we want to set the agent to the top-left posy Shaun let's go ahead and add those magic squares so of course you want to store that in our object now there's a little bit of HO keenest that I must explain so the agent is represented by a zero when we print out the grid to the terminal and MD squares are represented by a one and so excuse me and so that means we need something other than zero and one to represent these magic squares I want to when I render the environment I want to know where the entrance and where the exit is so we use different values for the entrance and exit so that way we can render it correctly so just by royal decree we set I which will be the representation of the of the magic square in the grid world 2 2 to start and I'm gonna go ahead and iterate over the magic squares so now what we need to know is what position we are in so you sorry its if the color is off indicating something is wrong I have screwed something up royally which I do not see because I am blind anyway so the the exposition is just going to be the floor of the current square and the number of rows and Y will be the modulus of the number of columns so then the grid we want to set that x and y position to I and since we want to have a different representation for the entrance and exits go ahead and set the increment I by 1 recall that the magic squares are is represented as a dictionary so we're iterating over the keys and the values are the destiny Asians so the keys are the source values our destinations so next we want to find out precisely that what the destinations are so that in and then set the grid that Square to I and then increment hi again and I'm only gonna do I'm only going to do two magic squares you can do any number but in this case we're just gonna do two so okay so the next thing we to know is if we are in the terminal state and as I said earlier the state space and state space plus concepts give us a very easy way of doing that so let's go ahead and take care of that so since the state space plus is all the states and the states space is all the states – the terminal state we know that the difference between these two sets is the terminal state so state in state space plus and not in this state space how does that look you scroll down a bit okay so next up let us go ahead and get the agent row and column and we're going to use the same logic as above so next we want to set the state so that we'll take the new state as input and we're going to go ahead and assume that the new state is allowed so the agent if it is along the left edge and attempts to move left it just receives a reward of minus one and doesn't actually do anything likewise if it's on the top row and attempts to move up it doesn't actually do anything it just gets a reward of minus one for wasting its time so we want to get sorry the row and column and set that space to 0 because 0 denotes an MB square and the agent position then is the new state and again we want to get the new X and new Y there is a typo there let's fix that and then set that position to 1 because that is how we represent the agent by royal decree so the next thing we have to know is if we're attempting to move off the grid that's not allowed the agent can only stay on the grid so let's take care of that so we want to take the new and old States as input and the first thing we wanna know is if we're attempting to move off the grid world entirely that's so I hate these editors so if we are the new state is not in the new state space plus we are attempting to move off the grid so you return true otherwise if the old state modulus M equals 0 and new state modulus this self dot M equals self dot M minus 1 then we return true and for brevity and I could explain this but the video is running long already so for brevity the reason this is true is left as an exercise to the reader bet you didn't know this was gonna be like a college course so now basically we're trying to do here I'll just give you a hint what we're trying to do here is determine if we're trying to move off the grid either to the left or to the right we don't want to wrap around so so if you're for instance if we have a 9 by 9 grid it goes from 0 to 8 so then if you add 1 right you would get 9 which would teleport you to the other row and the 0th column you don't want that what you want to do is waste waste to move and receive a reward of minus 1 so that's what we're doing here old state modulus good grief so if neither of those are true then you can go ahead and return false meaning you're not trying to move off the grid so let's see can you see that you can so next function we need is a way to actually step so let's go ahead and do that let's say the only thing we need is to take in the action so the first thing you want to do is get the X&Y again and here's where we're going to check to make sure it's a legal move so the resulting state is then agent position plus the mapping so the agent position is whatever it is and recall that the action space is this dictionary here that map's the actions to the translations in the grid so we're doing down here then is saying the new state is equal to the current state plus whatever the resulting translation is for whatever action we're attempting to make so next thing we need to know is are we on a magic square and if we are and if we are then the agent teleports to its new position okay so next up we need to handle the reward so it's minus one if not is terminal state so it's minus one if we haven't transitioned into the terminal state otherwise it is 0 if we're not trying to move off the grid then we can go ahead and set that state so if not set state salting state and then we're ready to go ahead and return so in the open the I Jim whenever you take a step it returns the new state the reward whether or not the game is over and some debug information so we're gonna do the same thing resulting state reward and the whether or not it is the terminal state and our debug info is just going to be none so if we are attempting to move off the grid what do we want to do nothing so we want to return agent position and the reward and rather not it's terminal and the null debug info we're almost there so next thing we need to know is how do we reset the grid because at the end of every episode we have to reset right first thing to do is set the agent position to zero if you set the grid to zeros and go ahead and add the magic squares back in and return the agent position which is of course zero oh wow that's real close all right so next up one last function I swear just one more I promise all right I wouldn't lie to you all right next up we want to provide a way of rendering because hey that's helpful for debug I like to print a whole big string of you know dashes because it's party we want to iterate over the grid for column in row column equals zero in other words if it's an empty square we're just going to print a dash and we're going to end it with a tab if the column is one meaning we have an agent there we're going to print an X to denote the agent now if the column is two then that is one of the entrances to our magic squares so print the a in with a tab delimited tab end if the column equals three you can print a alt and equals tab and if the column equals 4 then we know we're at the other magic square entrance and finally if it's 5 then we know we're at the other magic squares exit after each row we want to print a new line and at the end we'll go ahead and print another chunk of pretty dashes oh yeah that's it that is it okay so that is it for our agent class that only took how long 20-some minutes Wow okay I hope you're still with me basic idea here is to make your own environment you need and initialize a reset a state space state space plus a way to denote possible actions a way to make sure the move is legal and a way to actually effect that environment the step function needs to return the new position the reward whether or not the state is the new state is terminal as well as some debug information you also need a way of resetting and printing out your environment to the terminal so in part two we're actually gonna fire this baby up with aq learning algorithm and see how she do that's actually quite exciting it moderately exciting anyway it actually learns it does quite well and it does find the magic square spoiler alert if you made it this far it finds a magic square and gets out of the minimum number of moves required it's pretty cool to see so that will come in the next video on redness day I hope to see you all then if you liked the video make sure to leave a thumbs up subscribe if you have not already for more reinforcement learning content and I will see you all in the next video welcome back everybody to a new tutorial from neural net a eye I'm your host Phil Taber if you're new to the channel I'm a physicist former semiconductor process engineer turned machine learning practitioner if you haven't subscribed yet go ahead and hit the subscribe button so you don't miss any future of reinforcement learning tutorials when we left off in our previous video we just finished up the bulk of our opening IgM compliant reinforcement learning environment today we're going to go ahead and code up aq learning agent and the main loop of the program to see how it all performs so let's go ahead and get started so the first thing we are going to need is the is the magic squares right you recall the magic squares are the teleporters in the grid world that either advance the agent forward or backward so the first one is going to be at position 18 and dump out at position 54 so it'll move it forward and the next one will be at let's say 63 and dump out a position 14 so teleporter a will advance the agent through the grid world and teleporter B will send it back to an earlier square so we need to create our grid world we use a nine by nine grid and pass in the magic squares we just created next up we have to worry about the model hyper parameters so if you are not familiar with that let me give you a quick rundown these are the parameters that control how fast the agent learns and how much it chooses to value the potential future rewards so the first parameter is alpha that is our learning rate zero point one a gamma of one point zero tells us that the agent is going to be totally farsighted it will count all future rewards equally an epsilon of one point zero this is of course the epsilon for Epsilon greedy action selection so it will start out behaving pretty much randomly and eventually converge on a purely greedy strategy so a Q learning is a tabular method where you have a table of state in action pairs and you want to find the value of those state action pairs so to construct that we have to iterate over the set of states and actions state space Plus and you have a doubt possible actions and you have to pick something for an initial value it's really arbitrary but the cool thing about picking 0 is that we're using something called optimistic initial values well this means is that since the agent takes or receives a reward of minus 1 for every step it can never have a reward of 0 right because there's some distance between the agent and the exit so by setting the initial estimate at 0 you actually encourage exploration of unexplored states because if the agent takes a move it realizes oh I get a reward of minus 1 that's significantly worse than 0 let me try this other unexplored option so over time and will gradually explore all the available actions for any given state because it's been disappointed by all the stuff that is previously tried just a fun little fact we want to play 50,000 games we need a way of keeping track of our rewards numpy array will do just fine yes so now let's iterate over the total number of games and I like theif I like to print out a marker to the terminal so that way I know it's actually working so every 5,000 games just print they were starting the eye thing at the top of every episode you want to reset your done flag you want to reset your episode rewards so you don't accumulate rewards from episode to episode and of course you want to reset your environment oh let me scroll down here here we go you want to reset your environment just as you would with any open agent type problem next up we begin each episode so while not done we want to take a random number for our epsilon greedy action selection so we're going to just make use of this max action before we define it Q observation and in V dot possible actions and what that will do is we're gonna write it here momentarily but what it's gonna do is it is going to find the maximum action for a given state so the random number is less than 1 minus Epsilon we want to do that can you guys see that yep otherwise we want to take a random sample of the action space so we have to write these two functions let's do that quite quickly so the action space sample is pretty straightforward we can just return a random choice from the list of possible actions and that's the reason we chose a list as that data structure just to make it easy next up we have the max action max yeah max action function but that doesn't need to belong to the class that takes the cue the state and the set of possible actions we want to take a numpy array of the estimates agent sorry the agents estimate of the present value of the expected future rewards for the stated stand in all possible actions a in actions and then we want to find the maximum of that and that's just an index so we want sorry that's just an index so we want to return the action that that actually corresponds to all right that's all well and good do we need more space let's do a little bit more there we go so next we want to actually take the action so we get our new state observation underscore reward done and info yeah maybe that's step action and next up we have to calculate the maximal action for this new state so we can insert that into our update equation for the Q function so let's do that and we're not worried about epsilon greedy here because we're not actually taking that action next up we have to update our Q function for the current action and state and that's where our alpha comes in word plus make sure that is visible to you reward plus some quantity which is gamma our discount factor times Q observation underscore action underscore so the new state and action – hue observation action let me tab this over there we go nice and complying to the pet style guides right mostly okay so this is the update equation for the Q function that will update the agents estimates of the value of the current state and action pair next up we just need to let the agent know that the environment has changed states so you set observation to observation underscore and that is it for Q learning in a nutshell folks that's really that straightforward so the end of each episode we want to decrease epsilon so that way the agent eventually settles on a purely greedy strategy you can do this a number of ways you can do it you know with a square root function a log function I'm just gonna do it linearly it's not that critical for something like this so it's going to decrease it by 2 divided by number every game so about halfway through it'll be purely greedy and at the end of every episode you want to make sure you're keeping track of the total rewards which is something I forgot up here yeah so one thing I did forget is to keep track of the total reward for the episode don't forget that very important and at the end of all the episodes you want to plot the total rewards and that is it for the coding portion oh one of the things I take it back so let's scroll up here I do want to show you the environment so let's just do that and V dot render and the purpose of doing that is so that you can see how many moves it takes the agent to escape from the grid world that will tell us if it's doing good or not right because there is animal number of moves it takes to escape so I'm gonna fire up the terminal and go ahead and get that started one second and here we are in the terminal let's go ahead and run that and you can see here that we start at the top left so it takes one to moving to a out as free so three four five six seven eight nine ten eleven sorry eleven and the twelfth move is free because it's the exit to the maze sorry through to the grid world so total reward of minus 11 is the best the agent can't possibly do it's gone ahead and plotted so let's check that out and here is the plot and you can see that indeed the agent starts out rather poorly exploring finding suboptimal routes through the grid world but eventually and about half way through here at 2500 or so sorry 25,000 you can see that it settles on alisa constant let's prove that it is the maximum value of minus eleven and you can see that it's ten point nine seven that is close enough for government work so you see that the agent is able to actually solve the maze sorry the grid world I keep calling it a maze it's able to solve the grid world using the cube learning algorithm now this isn't surprising you know we would expect this what's novel here is that we have created our own reinforcement learning environment that uses a very similar format to the opening I gym so anytime you want to create a new environment you can use you can fire this video up and use the you know set of code here just as a template for your own projects I'm gonna put this up on my github I'll link that down below and I'm also gonna write up a tutorial in text form and upload it to neural net dot a I I don't know if all that done tonight I'll update the description with it and that's if you are a you know if you consume text more easily than that video then you can go ahead and check that out I hope this has been helpful make sure to leave a comment subscribe if you haven't already and I hope to see you all in the next video welcome back data manglers thanks for tuning in for another episode from neural net a I if you're new to the channel I'm Phil Taber a physicist and former semiconductor engineer turned machine learning practitioner I'm on a mission to teach the next generation of data engineers so we can stay one step ahead of our robot overlords you're not subscribed be sure to do that now so you don't miss any future reinforcement learning content we've touched on reinforcement learning many times here on the channel as it represents our best chance at developing something approximating artificial general intelligence we've covered everything from Monte Carlo methods to deep Q learning to policy gradient methods using both the pine torch and tensor flow frameworks what we haven't discussed on this channel is the what and the how of reinforcement learning that oversight ends today right now okay maybe a few seconds from now but either way we're gonna cover the essentials of reinforcement learning but first let's take a quick step back you're probably familiar with supervised learning which has been successfully applied to fields like computer vision and linear regression here we need mountains of data all classified by hand just to train a neural network while this has proven quite effective it has some pretty significant limitations how do you get the data how do you label it these barriers put many of the most interesting problems in the realm of mega-corporations and this does us the individual practitioners no good to top it off it's not really intelligence you and I don't have to see thousands of examples of a thing to understand what that thing is most of us aren't actively by doing sure we can shortcut the process by reading books or watching youtube videos but ultimately we have to get our hands dirty to learn if we abstract out the important concepts here we see that the essential stuff is the environment that facilitates our learning the actions that affected environment and the thing that does the learning the agent no jacket our labels required enter reinforcement learning this is our attempt to take those ingredients and incorporate them into artificial intelligence the environment can be anything from text-based environments like card games to classic Atari games to real to the real world at least if you're not afraid of Skynet starting an all-out nuclear war that is our AI interacts with this environment through some set of actions which is usually discrete move in some direction or a fire at the enemy for instance these actions in turn caused some observable change in the environment meaning the environment transitions from one state to another so for example in the space invaders environment in the opening I Jim attempting to move left cause the agent to move left with 100% probability that need not be the case though in the frozen lake environment attempting to move left can result in the agent moving right or up or down even so just keep it in mind that these state transitions are probabilistic and their probabilities don't have to be 100% merely their sum the most important part of the environment is the reward or penalty the agent receives if you take only one thing away from this video it should be that the design of the reward is the most critical component of creating effective reinforcement learning systems this is because all reinforcement learning algorithms seek to maximize the reward of the agent nothing more nothing less in fact this is where the real danger of AI is it's not that it would be malicious but that it would be ruthlessly rational the classic example is the case of an artificial general intelligence his reward is centered around how many paper clips it turns out sounds innocent right well if you're a paper clip making bot and you figure out that humans consume a bunch of resources that you need to make paper clips then those pesky humans are in the way of an orderly planetary scale office that's problematic for all involved what this means is we must think long and hard about what we want to reward the agent for and even introduce penalties for undertaking actions that endanger human safety Ellison systems that will see action in the real world perhaps less dramatic although no less important are the implications for introducing inefficiencies in your agent consider the game of chess you might be tempted to give the agent a penalty for losing pieces but this would potentially prevent the agent from discovering gambits what sacrifices a piece for a longer term positional advantage the alpha 0 engine a chess-playing artificial intelligence is notorious for this and will sacrifice multiple ponds and yet still dominate the best traditional chess engines we have to offer so we have the reward the actions and the environment what are the agent itself the agent is the part of the software that keeps track of these state transitions actions and rewards and looks for patterns to maximize its total reward over time the algorithm that it's how the agent will act in any given situation or state of the environment is called its policy it is expressed as a probability of choosing some action a given the environment is in some state s please note these probabilities are not the same as the state transition probabilities the mathematical relationship between the state transitions rewards and the policy is known as a bellman equation and it tells us the value meaning the expected future reward of a policy for some state of the environment reinforcement learning often though not always means maximizing or solving that bellman equation more on that in future videos this desire to maximize reward leads to a dilemma should the agent maximizes short-term reward by exploiting the best known action or should it be adventurous and choose actions whose reward appears smaller or maybe even unknown this is known as the Explorer exploit a lemma and one popular solution is to choose the best known action most of the time and occasionally choose a suboptimal action to see if there's something better out there this is called an epsilon greedy policy when we think of reinforcement learning we're often thinking about the algorithm the agent uses to solve the bellman equation these generally fall into two categories algorithms that require a full model of their environment and not rhythms that don't what does this mean exactly to have a model of the environment as I said earlier actions caused the environment to transition from one state to another with some probability having a full model of the environment means knowing all the state transition probabilities with certainty of course it's quite rare to know this beforehand and so the algorithms that require a full model are somewhat limited utility this class of algorithms is known as dynamic programming if we don't have a model or a model of the environment is incomplete we can't use dynamic programming instead we have to rely on the family of model free algorithms one popular such algorithm is Q learning or deep Q learning which you studied on this channel these rely on keeping track of the state transitions actions rewards to learn the model on the environment over time in the case of q-learning these parameters are saved in a table and in the case of deep queue learning the relationships between them are expressed as an approximate functional relationship we just learned by a deep neural network that's really all there is at least at a high level so to recap reinforcement learning is a class of machine learning algorithms that help an autonomous agent navigate a complex environment the agent must be given a sequence of rewards or penalties to learn what is required of it the agent attempts to maximize this reward over time or in mathematical terms to solve the bellman equation the algorithms that help the agent estimate future rewards fall into two classes those that require we know the state transition probabilities for the environment beforehand and those that don't since knowing these probabilities is a rare luxury we often rely on model free algorithms like deep Q learning you'd like to know more please check out some of the other videos on this channel I hope this has been helpful please leave a comment I like and subscribe if you haven't already I look forward to seeing you all in the next video welcome back did the free reinforcement learning course from neural net dot AI I'm your host Phil Taber if you're not subscribed be sure to do that now and hit the bell icon so you get notified for each new module in the course in module one we covered some essential concepts in reinforcement learning so if you haven't seen it go ahead and check it out now so this module makes more sense if you have seen it you may remember that reinforcement learning basically boils down to an agent interacting with some environment and receiving some rewards in the process these rewards tell the agent what's good and bad and the agent uses some algorithm to try to maximize rewards over time in practice what we get is a sequence of decisions by the agent and each decision doesn't just influence its immediate reward rather each decision influences all future rewards in mathematical terms we have a sequence of states actions and rewards that one could call a decision process if each state in this process is purely a function of the previous state and action of the agent then this process is called a Markov decision process our MDP for short these are an idealized mathematical abstraction that we use to construct the theory of reinforcement learning from any problems this assumption can be broken to various degrees how much that really matters is often a complicated question and one we're just going to dodge for now regardless in most cases the assumption that a process obeys the Markov property is good enough and we can use all the resulting mathematics for reinforcement learning problems by now I've said that a reinforcement learning agent seeks to maximize rewards over time so how does this fit into a Markov decision process from the agent's perspective it receives some sequence of rewards over time and as sequence of rewards can be used to construct the expected return for the agent then the return at some time step T is just the sum of the rewards that follow all the way up to some final time capital T this final time step naturally introduces the concept of episodes which are discrete periods of gameplay that are characterized by state transitions actions and rewards upon taking this final time step the agent enters some terminal state which is unique this means that no matter how we end the episode the terminal state is always the same no future rewards follow after we reach the terminal state so the agents expected reward for that terminal state is precisely 0 with a bit of creativity we call tasks that can be broken into episodes episodic tasks of course not all tasks are episodic many are in fact continuous this is a bit of a problem since if the final time step is at infinity the total reward could also be infinite this makes the concept of maximizing rewards meaningless so we have to introduce an additional concept the fix and for both episodic and continuing tasks is the idea of discounting this basically means the agent values future rewards less and less this discounting follows a power law where each time step results and more and more discounting this hyper parameter gamma is called the discount rate and even no doubt seen this before in our videos on reinforcement learning if you use this form for the expected return and do some simple factoring you derive a really useful fact there is a recursive relationship between rewards at subsequent time steps this is something we'll exploit constantly in reinforcement learning so we have an agent that is engaged in some discrete processes receiving rewards and trying to maximize its expected feature returns if you remember from the first lecture the algorithm that determines how the agent is going to act is called its policy since the agent has a set of defined rules for how its going to act in any given state it can use a sequence of states actions and rewards to figure out the value of any given state the value of a state is the expected return when starting in that state and following the policy it's given formally by the following equation in some problems like say Q learning we're more concerned with maximizing the action value function which tells the agent the value of taking some action while in some given State and following the policy they're after remember how I said we can exploit the recursive relationship between subsequent returns well if you plug that into the expression for the value function we actually discover that the value function itself is defined recursively this is called the bellman equation from the first module and this is the quantity many algorithms seek to maximize the bellman equation is really an expectation value as it's a weighted average of how likely each particular sequence of states actions and rewards is given the state transition probabilities and the probability of the agents selecting that action much of the following material will involve coming up with various schemes to solve the bellman equation and evolve the policy and such a way that the value function increases over time in the next module we'll take a look at the explore exploit dilemma which is the expression of the trade-off between long and short term rewards I hope this has been helpful questions comments suggestions leave them below I read and answer all my comments if you made it this far consider subscribing so you get notified in the rest of the course drops I look forward to seeing you in the next video welcome to module three of the free reinforcement learning course from neural net dot AI I'm your host Phil Taber if you're not subscribed make sure to do that now so you don't miss the rest of the course in the previous video we learned about a special type of process called the Markov decision process there each state depends only on the previous state and the action taken by the agent this leads to the recursive relationship between the agents estimate of returns at successive time steps this relationship extends to the agents estimate of the value function which is given by the bellman equation as we covered in module 1 reinforcement learning for the most part boils down to maximizing this value function however it's not always so simple surprise surprise just like you and I have trade offs in real life reinforcement learning agents are faced with similar considerations to the agent take the action that it knows will immediately provide the most reward or should it explore other actions to see if it can do better this conundrum is known as the Explorer exploit dilemma and every reinforcement learning algorithm has to deal with this fortunately there are many solutions and we'll cover some of them here one such solution is the idea of optimistic initial values when the agent starts playing the game it has to use some initial estimate for the value or action value function this estimate is totally arbitrary but if we know something about the reward structure beforehand we can actually initialize it in such a way as to encourage exploration suppose we have an environment like our grid world in the video on creating our own reinforcement learning environment in that environment the agent receives a reward of minus one for each step and so the expected returns are always negative or zero no matter of the state of the environment or the action the agent takes so what would happen if we tell the agent that the value of all the state action pairs are positive or even zero on the first move the agent picks some action randomly because all the actions all look identical it receives a reward of minus one and updates his estimates accordingly so it's a bit disappointed it was expecting chocolate cake and got a mud pie the next time it encounters that state it will take a different action because the other actions have an estimate of zero reward for that state which is better than the negative reward it actually received this means that the agent ends up exploring all the state action pairs many times as each update makes the agents estimate more and more accurate we never had to explicitly tell the agent to take exploratory actions because it's greed drove it to take exploratory actions after it became disappointed with whatever action it just took again this is called optimistic initial values another feasible solution is to spend some portion of the time choosing random actions and the majority of the time choosing greedy actions this is called an epsilon greedy strategy and it's the one we employ the most it's quite robust as we can change the random parameter over time so that the agent converges on to a nearly pure greedy strategy the proportion of the time the agent spends exploring is a hyper parameter of the problem and we typically call it epsilon one potential strategy is to start out completely randomly and then use some decay function to gradually increase the proportion of ridi actions the agent takes the form of this function isn't critically important it can be linear a power law or really any other function whether or not the agent converges to a purely greedy strategy is going to depend on the problem for simple environments like the grid world where we know the optimal solution beforehand it makes quite a bit of sense converged to a purely greedy strategy however with a game like space invaders a popular environment from the open AI Jim there are so many variables that it's hard to be sure the agent has settled on the truly optimal strategy the solution there is to leave Epsilon it's so small but finite value so the Asian is occasionally taking exploratory actions to test its understanding of the environment all of this discussion has made a very important assumption we've assumed the agent only uses a single policy the agent uses both the same policy to update this estimate of the value function as well as to generate actions there's no rule this has to be the case in fact an agent can leverage two policies it can use one policy to generate actions and then use the data that generates to update the value function for some other policy this is called off policy learning and this is precisely what we use in Q learning the agent uses some epsilon greedy strategy to generate steps in the Markov chain which is the sequence of state action rewards and resulting states and then uses that data to update the estimate of the action value function for the purely greedy action in effect we're using an epsilon greedy strategy to update our estimate of the purely greedy strategy let's just say this works quite well and it's something we'll come back to in later modules when we get to Monte Carlo methods and temporal difference learning that's it for now reinforcement learning agents seek to maximize their total reward but face a dilemma and whether to maximize current reward or take exploratory steps with suboptimal actions in the hope of optimizing long term rewards one solution is to bias the agents at initial estimates in such a way that it encourages exploration before settling on a purely greedy strategy another is suspense on proportion of the time exploring and the majority of the time exploiting the best known action and finally the agent can leverage two policies want to generate data and the other to update the estimate of the action value or value function in the next module we're going to get to dynamic programming class of model-based reinforcement learning algorithms make sure to subscribe so you don't miss the room of this course and I look forward to seeing you in the next video welcome back everybody to machine learning with Phil I am your host dr. Phil when we last touched on the opening idea we did Q learning to teach the cart pole robot how to dance basically how to balance the pole in this video we're gonna take a look at a related algorithm called salsa so they're related in the sense that they're both types of temporal difference learning algorithms the difference being that salsa is an on policy method and q-learning is an off policy method hey parents by the cat if you if you don't know what that means I highly encourage you to check out my course reinforcement learning in motion on Manning publications I go in-depth on all this stuff in that course enough plugging let's get back to it so the other cool thing is that it that sarsa as well as key learning are model free meaning that you do not need a complete model of your environment to actually get some learning done and that's important because there's many cases in which you don't know the full model of the environment what does that mean it means you don't know the state transition probabilities so if you're in some state s and take some action a what is the probability will end up in state s prime and get reward are those probabilities are not completely known for all problems and so algorithms that handle that uncertainty are critical for real-world applications another neat thing is that this is a bootstrap method meaning that it uses estimates to generate other estimates right so you don't need to know too much about the system to get started you just make some wild-ass guesses and you get moving let's take a look at the algorithm so your first step is to initialize your learning rate alpha and of course that's going to control the rate of learning how do you make adjustments to the Q function then you initialize the Q function the hue function is just the agents estimate of its discounted future rewards starting from a given state s and taking an action a and it may have some assumptions built in onto whether or not you follow some particular policy or not but that's the general gist so you need to initialize your state and choose some initial action based on that state using an epsilon greedy strategy from that function queue any loop over the episode taking the action getting your reward and your new state s prime choose an action a prime as a function of that state s prime using epsilon greedy from your Q function and then go ahead and update the Q function according to the update rule you see on the screen and then go ahead and store your state prime into s and your a prime into a and loop until the episode is done again in the course I go into many more details this is just quick and dirty a bit of a teaser video to get you guys interested in the course and to give you some useful information at the same time so with that being said let's go ahead and jump into the code I'm not gonna be doing typing on-screen but I will be showing you the relevant code as we go along and boom we are back in the code editor so here I am using Visual Studio code even on Linux this is a great editor if you're not using it I highly recommend it Adams a little bit buggy for me and of course sublime is now is nag where so go ahead and give it a look if you haven't already so we needed to find a function to take the max action and that takes as inputs the Q function as well as the state and you're just converting the the Q function into an array and to a numpy array free to action and that in that list and finding the Arg max of that now recall that in numpy the Arg max takes the returns the first element of a max so if you have two actions that are tied it'll give you the first one so of course in the carpol example our action space is just moving left and right right if you don't remember it's just a cart and slides along the x axis trying to keep a pole vertical of course this is a continuous space and the Q function is a discrete a discrete mathematical construct right so the states are discrete numbers and so you have to do a little trick here to discretize your space and so if you look in the documentation for the cart poll example you'll find the limits on these variables and you can use that to create a linear space based out of it based on those limits and divide it up into ten different buckets right so that way you get you go from a continuous representation to a discrete representation of your state space and then I define a small helper function to get the States based on the observation it just digitizes these it digitizes those linear spaces using the observation that you pass in from the open antigen and it returns a four vector that is a the buckets that correspond to the value of the element of the observation in the main program we want to use a small learning rate alpha 0.1 for a gamma something like 0.9 of course the gamma is the discount factor it's debatable whether or not you need it here so discounting in general is used when you don't know the we you don't know for a certain you're gonna get some reward in the future so it doesn't make sense to give it a hundred percent weight you could just as easily here use a one point zero because the state-transition functions in the car pool example are deterministic as far as I'm aware some if I'm wrong please someone correct me and of course the epsilon for the epsilon greedy we're going to start out at one point zero you'll see why here in a second and so you need to construct the set of states which of course just corresponds to the integer representations of our continuous space so you just have ranges from zero to zero to nine and you construct a four vector out of out of that right so you have 0 0 0 1 1 1 etc etc etc and initialize your Q function here I'm going to initialize everything as a zero right recall that we had to we could initialize it arbitrarily but for the terminal States you want that to be zero because again the value of the terminal state is 0 and a is 2 in the range of 2 because we only have two actions move left move right oops also I'm gonna run 50,000 games if you have a slower computer you might want to run fewer it takes quite a bit of time to run and I'm gonna track the total rewards as we go along so just a little helper line here to print out the the number of games you're playing that's always good to know where you are right you if it stops chugging along you want to know if it's broken or back they doing something useful so you get your initial observation by resetting the environment get your state and calculate a random number and so you take a maximum action if the random number is less than 1 minus epsilon so epsilon is starting out at 1 so if random is less than 0 otherwise randomly sample your action space done flag to false and your rewards for the episode to 0 then you loop over the episode until you're done and you go ahead and take the action a getting your reward and the new observation the state prime then is going to be the get State of the observation right these observation is a four-vector of continuous numbers that we had to transform into a set of discrete integers for vector of discrete integers then we go ahead and calculate another random number and choose another action based upon that and calculate sum up the total rewards and update the q function based on the update rule I gave you in the slides and of course set the state in action to the new the S Prime and a prime and after each episode you're going to decrease epsilon because you want this you don't want the epsilon to be permanently 1 right you want to encourage some amount of exploration and some amount of exploitation so epsilon has to be a function of time and just save your total rewards when it's all done its gonna go ahead and plot it out and you should see something similar to the following I'm going to go ahead and run that now and that is going to take a minute to run and so here you have the output of the source of algorithm after running 50,000 iterations so what you see is first of all the messy plot that's to be expected with 50,000 games and you're plotting every single point but what you've noticed immediately is that there's a general trend upward and when epsilon reaches its minimum epsilon goes to 0 and it does a fully exploitative strategy the algorithm actually does a really good job of hitting 200 moves most of the time recall that 200 moves is the 200 moves is the maximum number of steps for the carpool problem because good algorithms can get it to balance pretty much indefinitely so I would never terminate so the open a gym just terminates at 200 steps so anything close to that is pretty good now one thing that's interesting is that it does have a fair amount of variability it doesn't actually balance it 200 moves the entire time and there are a number of reasons for this perhaps you can speculate below I invite you to speculate my thought process is that the the way we have discretized in this space isn't sufficient to characterize the problem in such a way that the algorithm can learn something completely and totally useful so it just doesn't have enough information based on the ten thousand ten to the four yeah ten thousand states we've we've discretized it into and there could be other things that matter you know you could have other features for instance combinations of velocities and positions that matter so we could have under engineered the problem slightly but for just a quick little chunk of 170 lines of code or so it's actually quite good so any questions you should be sure to leave them below and hey if you've made it this far and you haven't subscribed please consider i'm gonna be releasing more and more content like this and do this full time now and i look forward to seeing you all in the next video oh and by the way in the next video we're gonna be taking a look at double q– learning which is yet another variation of these model free bootstrap methods see you then oh and one other thing if you want the code for this I'll leave the code I'll leave the link to my github now this is code for my course reinforcement learning in motion I'm just showcasing it here to show what you're gonna learn in the course so go ahead and click the link in the description and it'll take you to my github where you can find that code as well as all the code from the course hope you like it see you guys in the next video welcome back everybody to machine learning with Phil I am your host dr. Phil in yesterday's video we took a look at salsa in the opening I June getting the cart pole to balance itself as promised today we are looking at the algorithm of double Q learning also in the cart pole opening IgM environment so we touched on Q learning many many months ago and the basic idea is that Q learning is a model-free bootstrapped of policy learning algorithm what that means is model free it does not need to know it does not need the complete state transition dynamics of the environment to function it learns the game by playing it bootstrapped in that it doesn't need very many very much help getting started it generates estimates using its initial estimates which are totally arbitrary except for the terminal states off policy meaning that it is using a separate policy other than it is using a behavior policy and the target policy to both learn about the environment and generate behavior respectively now when you deal with problems that when you deal with algorithms that take a maximizing approach to choosing actions you always get something called maximization bias so say you have some set of states with many different actions such that the action value function for that states and all actions is zero then the agents estimate the Q capital Q of SN a can actually be well actually have some uncertainty to it and that uncertainty is actually a spread in the values right and that spread causes it to have some amount of positive bias and the max of the true values is 0 but the max of the capital Q the agents estimate is positive hence you have a positive bias and that can often be a problem in in reinforcement learning algorithms so this happens because you're using the same set of samples to max to determine the maximizing action as well as the value of that action and one way to solve this problem is to use two separate q functions to determine the max action and the value and you set up a relationship between them and then you alternate between them as you play the game so you're using one of them to determine the max action one of them to determine its value and you alternate between them so that you eliminate the bias over time that's double Q learning in a nutshell the algorithm is the following so you initialize your alpha and your epsilon where alpha is your learning rate epsilon is what you use for epsilon greedy you would initialize the Q 1 and Q 2 functions for all states and actions in your state and action space of course that's arbitrary except for the terminal States we must have a value of 0 and you loop over your set of episodes and you initialize your state and for each episode right each step within the episode choose an action from using your state using Epsilon greedy strategy in the sum of q1 and q2 so you have the two separate queue functions so if you're using single queue learning you would take the max action over just one queue but since you're dealing with two you have to account for that somehow right you could do a max you could do a sum you could do an average in this case we're gonna take the sum of the two queue functions take that option get your reward observe the news state and then with the 0.5 probability either update q1 or q2 according to this update rule here and of course at the end of the step go ahead and set your estate to the new state and keep looping until the game is done so clear as mud I hope so by the way if you want more reinforcement learning content make sure to subscribe hit the bell icon so you get notified let's get to it so next up we have our code and here we are inside of our code editor again am using Visual Studio code to take a look at our double Q learning script I'm not gonna be typing into the terminal I think that's probably a little bit annoying I'm just gonna review the code as we go along if you have seen my video on the Saros an algorithm there's gonna be a fair amount of overlap because we're solving the same set of problems over again the only real difference is in that video we just saw so to calculate the action value function and in this case we're using double Q learning again we have a max action function well this does is tells us the max action for a given state to construct that you make a numpy array out of a list that is for a given state in both actions and as we said in the video we're gonna take the sum of the q1 and q2 for a given state for both actions you want to take the arc max of that and recall in numpy the Arg max function if there is a tie returns the first element so if the left and right actions both have identical action value functions then it will return the left action consistently that may or may not be a problem it's just something to be aware of and once again we have to discretize the spaces recall that the carpool problem which is just the cart sliding along a track with a pole that is that must be maintained vertically right and the cart pole example we have a continuous space the X and theta can be any number within a given range and likewise for the velocities to deal with that we have a couple options we could simply use neural networks to approximate those functions but in this case we're going to use a little trick to discretize the space so we're going to divide it up into ten equal chunks and any number that falls within a particular chunk will be assigned an integer so you'll go from a continuous to a discrete representation of your for vector the observation along with that comes a get state observing state along that comes a get state function that you pass in the observation and it just uses those digitized spaces excuse me just uses those linear spaces to use the numpy digitize function to get the integer representation of the respective elements of your observation I've also added a function to plot the running average here I do this because in this R sub video we ended up with a little bit of a mess with fifty thousand data points this will plot a running average over the prior 100 games next up we have to initialize our hyper parameters our learning rate of 0.1 this just controls the step size in the update equation the gamma is of course the discount factor the agent uses in its estimates of the future rewards so I don't believe this should actually be 0.9 IU I left it here because it's not super critical as far as I'm concerned it should really be 1.0 and the reason is that the purpose of discounting is to account for uncertainties in future rewards if you have some sequence of rewards with a probability of receiving them then it makes no sense to give each of those rewards equal weight because you don't know if you're going to get them in the cart poll example the rewards are certain as far as I'm aware of the state transition probabilities are one you know that if you move right you're gonna actually end up moving right you know deterministically where the poll and the cart are going to move so it shouldn't be discounted as far as I'm concerned epsilon is just the epsilon factor for our for our epsilon greedy algorithm and that's pretty much it for hyper parameters of the model next up we had to construct our state space so what this means Oh baby's unhappy the state space is of course the the representation of the digitized space so we're gonna have for the cart position you're gonna have ten buckets the velocities ten buckets and likewise for the thetas thetas position and theta velocity so you're gonna have ten to the four possible states so ten thousand states and those are gonna be numbered all the way from 0 0 0 to 9999 that's all we're doing here is we're constructing the set of states next up we have to initialize our Q functions recall that the initialization is arbitrary except for the terminal state which must have a value of 0 the reason for this is that the terminal state by definition has a future value of 0 because you stop playing the game right makes sense you could initialize this randomly you could initialize it with minus 1 plus 1 doesn't really matter so long as the terminal state is 0 for simplicity I'm initializing everything at 0 I'm gonna play a hundred thousand games the reason is that this algorithm eliminates bias but at the expense of convergent speed so you have to let it run a little bit longer an array for keeping track of the total rewards and we're gonna loop over a hundred thousand games printing out every 5,000 games to let us know it's still running I always want to reset you're done flying your rewards and reset the episode at the top and you're gonna loop over the episode getting your state calculating a random number for your epsilon greedy strategy you're gonna set the action to be the max action of Q 1 and Q 2 if the random number is less than 1 minus epsilon otherwise you're gonna randomly sample your action space in any event you take that action get your new state reward and done flag and go ahead and tally up your reward and convert that observation to a state s prime then go ahead and calculate a separate random number the purpose of this random number is to determine which q function we're going to update you know we're gonna be using one to calculate we're alternating between them because we have to eliminate the maximization bias right one is for finding the max action one is for finding the value of that action we alternate between episodes by way of this rain number in both cases you want to collect the you want to calculate the max action either q1 or q2 and use the update rule I showed you in the slides to update the estimates for q1 and q2 as you go at the end of the episode sorry at the end of the step excuse me you want to reset the old observation to the new one so that way you can get the state up here and at the end of the episode you want to go ahead and decrease Epsilon if you're not familiar with this epsilon greedy is just a strategy for dealing with the explore exploit dilemma so an agent always has some estimate of the future rewards based on this model the environment or its experience playing the game if it's model free or a model of problem right it can either explore or exploit its best known action so one way of dealing with the dilemma of how much time should you spend exploring versus how much time should you spend exploiting is to use something called epsilon greedy meaning that some percentage of time you explore or some percentage of the time you exploit and the way that you get it to settle on a greedy strategy is to gradually decrease that exploration parameter epsilon over time and that's what we're doing here and of course you want to keep track of the total rewards for that episode and recall in the carpol example the agent gets a reward of positive 1 every time the pole stays vertical and so every move that it doesn't flop over it gets one point and at the end you're gonna go ahead and plot your running averages so I'm gonna go ahead and run that and that'll take a minute while it's running I want to ask you guys a question so what type of material do you want to see from what I'm seeing in the data the the reinforcement learning stuff is immensely popular my other content not so much so I'm going to keep focusing on this type of stuff but are you happy seeing the Sutton and Bardot type introductory material or do you want to see more deep learning type material right there's a whole host of dozens of deep reinforcement learning algorithms we can cover but I'm actually quite content to cover this stuff because I believe that if you can't master the basics then the deep learning stuff isn't going to make sense anyway right because you have the complexity of deep learning on top of the complexity of the reinforcement learning material on top of it so if there's anything a particularly you guys want to see make sure to leave a comment below and hey if you haven't subscribed and you happen to like reinforcement learning in machine learning material please consider doing so if you liked the video make sure to leave a thumbs up if you thought it sucked go ahead and leave a thumbs down and tell me why I'm happy to answer the comments hence your objections and if you guys have suggestions for improvement I'm all ears and here we are it is finally finished with all hundred thousand episodes and you can see here the running average over the course of those games as you would expect the agent begins to learn fairly quickly balancing the cart pull more and more and more by about sixty thousand games it starts to hit thee consistently hit the 200 move threshold where it is able to balance the cart pull all 200 moves of the game now recall this was with a gamma of 1.0 I'm going to go ahead and rerun this with a gamma of 0.9 and see how it does so burn this image into your brain and I'm gonna go ahead and check it out with a gamma of 0.9 and see if we can do any better then we are back with the second run using a gamma of 0.9 and you can see something quite interesting here so it actually only kind of ever reaches the 200 mark just for a handful of games and then kind of stutters along actually decreasing in performance as it goes along so something funny is going on here and to be frank I off the top of my head and not entirely certain why so I invite you all to speculate however the what's also interesting is that I this is the second time of recording this I recorded it earlier and didn't scroll down the code so you ended up staring at the same chunk of stuff had to redo it and in that case I had a gamma is 0.9 as well and it seemed to work just fine so I suspect there's some significant variation here to do with the random number generator and I could just all be do to that right this is a complex space and it wanders around different portions this could happen potentially because it doesn't visit all areas of the parameter space enough times to get a reasonable estimate of the samples and there may be some type of bias on where it visits later on in the course of the episodes although that sounds kind of unlike to me but either way that is double q-learning you can see how the hyper parameters actually affect the model and it seems to have a fairly large effect as you might expect and the next video we're gonna be taking a look at double sarsa so if you are not subscribed I ask you to please consider doing so hit the notification icon so you can see when I release a video and look forward to seeing you all in the next video well I hope that was helpful everyone so what did we learn we learned about Q learning policy gradient methods sarsa double Q learning and even how to create our own reinforcement learning environments this is a very solid foundation in the topic of reinforcement learning and then you're pretty well prepared to go out and explore more advanced topics so what are those more advanced topics so right now the forefront are things like deep deterministic policy gradients which is as you might guess from the name a more advanced version of policy gradient methods they're also actor critic methods behavioural cloning there's all sorts of more advanced topics out there that you're now pretty well-equipped to go explore these are particularly useful in environments where you have continuous action spaces so all the environments we studied in this set of tutorials have a discrete action space meaning the agent only moves or takes some discrete set of actions other environments such as the bipedal Walker car racing things of that nature have continuous state spaces so excuse me a continuous action spaces which require different mechanisms to solve Q learning really can't handle it so you're now free to go ahead and check that stuff out if you made it this far please consider subscribing to my channel machine learning with phil and i hope this is helpful all of you leave a comment down below and make sure to share this and i'll see you all in the next video