Reinforcement Learning

Introduction

Reinforcement Learning is a certain way of programming, an algorithm if you must. In nature, we can observe that both humans and animals take actions which will give them rewards. RL takes place in humans and animals using the neurotransmitter called dopamine, but in machines we can easily store this reward as a defined variable, into the memory. This reward-based decision making is called Reinforcement Learning, as a part of Machine Learning.

Now RL allows us to program an agent to make decisions, based on rewards. This pattern is inspired by behaviourist psychology. We will resort to using mathematical equations as a programming language, instead of using pseudocode. But this is only to clarify things and make explanations easier. No worries, details will be provided.

Once the algorithm has been explained, you will be able to code the concepts in any language of your choice.

Formal Definitions

Consider that we have an environment and an agent in that environment. An agent can be anything, a human, a robot or a software agent in a virtual environment. This agent also has the capacity to act. Every action that it takes, might also affect the environment and hence also affect its future actions. Due to this tendency, this process is a Markovian. Thus, the agent has to carefully choose its actions so that it can control what happens in its future states.

How does the agent know what actions to choose ? Which are better and which aren't ? This evaluation is done by a reward. The agent can compute the action it is going to perform, and its state as a model, and furthermore, this observation from the environment can also yield a reward. This reward allows the agent to plan/reason/decide how to make its next move.

The way in which an agent maps a behaviour is called its policy.
The quality of the action(+)state is called the value function.

On-policy methods estimate the value of a policy while using it for control.

Off-policy methods contain a policy used to generate behaviour, called the behaviour policy, which may be unrelated to the policy that is evaluated and improved, called the estimation policy. This separation can allow an exploitation-based evaluation policy and a exploration-based behaviour policy at the same time.

Author Description