Up to this point, we've only described the reinforcement learning problem: given an MDP, we want to figure out good actions that will maximizes the sum of our rewards (i.e. the return). The process of deciding an action from a state is known as a policy, so in other words, we want to learn the best policy for a given task. There are several different algorithms that do this, but one of the most straightforward that we'll look at here is known as Q-learning... Note the action selection process. Initially, our agent has no idea what good actions are. As such, we want it to explore very broadly, so that it can get a diverse range of experience that it can build off of. The method will determine some exploration rate depending on how far into training we are. As the agent is more and more trained, it will take random actions (i.e. explore) less and more often take the best action available. This is known as an "epsilon-greedy" policy. When we're done training, or evaluating our model, we..
  • 0
  • 0
Interest Score
9
HIT Score
0.73
Domain
brandonmorris.dev

Actual
brandonmorris.dev

IP
104.21.16.1, 104.21.32.1, 104.21.48.1, 104.21.64.1, 104.21.80.1, 104.21.96.1, 104.21.112.1

Status
OK

Category
Company
0 comments Add a comment