Reinforcement learning is a technique largely used for training gaming AI — like making a computer win at AlphaGo or finish Super Mario Bros levels super fast. Reinforcement learning algorithms, or agents, learn by interacting with their environment.
In our latest “In Plain English” blog post, we unpack what reinforcement learning is in more (digestible) detail and the potential it has with the rise of Generative AI.
Unpacking Reinforcement Learning
Many view reinforcement learning as an approach that falls between supervised and unsupervised learning. As a reminder, in supervised learning, the algorithm trains on labeled data. The labeled data acts as a teacher, providing the algorithm with examples of what the correct output should be.
On the other hand, we typically use unsupervised learning when we want to identify patterns and relationships in data. People typically use it when they work with large datasets where labeling the data may be time-consuming or impractical.
According to popular data science, ML, and AI website KDNuggets, “Reinforcement learning is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences. Though both supervised learning and reinforcement learning use mapping between input and output, unlike supervised learning where feedback provided to the agent is the correct set of actions for performing a task, reinforcement learning uses rewards and punishment as signals for positive and negative behavior.”
Essentially, reinforcement learning is not strictly “supervised” as it doesn’t rely exclusively on a set of labeled training data, but it’s not “unsupervised” because there’s a reward that we want the reinforcement learning agent to maximize.
As alluded to above, there are many examples of reinforcement learning such as game-playing AI like Google’s AlphaGo, where an action is to move a piece on the table (where the environment is a layout of the table with all of the pieces) with the goal of winning the game.
Unlike supervised learning, reinforcement learning has no labels — you take certain actions and see what the outcome is. If you win the game, you “reinforce” the moves you made in the game. If you lose, you negatively reinforce the moves of that game, meaning the next time you play, you are less likely to make those moves and rather repeat the ones that led to a victory.
Let’s take another example. Imagine you’re a javelin thrower. In one case (supervised learning), you have a coach that is telling you at each moment: Your arm should be like this, you should accelerate, you should look in this direction. In the other case (reinforcement learning), you just know the distance at which your javelins land.
You can improve yourself in both cases, but you’ll need many more trials in the second case (because it’ll be hard to know what precisely in your actions led to the more or less satisfying result). This emphasizes two key messages about reinforcement learning:
- Rewards can be enough to learn but it’s far less effective than if you know the correct answer.
- It’s sometimes easier to measure rewards than collect correct answers.
Reinforcement learning becomes interesting when the availability and affordability of measuring rewards outweigh their lesser learning value compared to correct answers.
What Are the Basic Elements of a Reinforcement Learning Problem?
Here are some helpful key terms to get to know when working with reinforcement learning:
- Environment: The physical world in which the agent operates
- State: The current situation of the agent
- Reward: The feedback from the environment
- Policy: The method to map the agent’s state to actions
- Value: The future reward that an agent would receive by taking an action in a particular state
In this example of PacMan from Towards Data Science, the goal of the agent (in this case, PacMan) is to eat the food in the grid while avoiding the ghosts on its way around the grid. In this instance, the grid world is the interactive environment for the agent where it acts.
The agent (PacMan), receives a reward for eating food and punishment if it gets killed by the ghost (i.e., loses the game). The “states” are the location of the agent (PacMan) in the grid world (the environment) and the total cumulative reward is the agent (PacMan) winning the game.
Practical Applications of Reinforcement Learning
As we’ve mentioned, reinforcement learning is commonly used in building the AI systems involved in playing computer games (i.e., AlphaGo, chess, ATARI games, backgammon, etc.). Although many of these games where reinforcement learning can be used have complete information, it is possible for it to work for games with incomplete information.
This will actually be the case more often from not: If you are using reinforcement learning to guide a robot to get from place A to place B, the robot will only have observations about the environment through radar, images, sound, and so on and navigate based on this information because it’s all it has to capture the state of the environment. The same can be said for self-driving cars; they are learning as they go.
In the robotics and industrial automation space, reinforcement learning is used to grant robots the abilities to learn, adapt to, and improve in tasks with constraints that are constantly changing. The more robots learn using reinforcement learning, the more accurate they become and the faster they can complete a previously time-consuming task (i.e., bin picking in a warehouse).
Where Generative AI Comes In
Now, how does reinforcement learning come into play with Generative AI (and Generative AI applications such as ChatGPT)? It’s well explained in this “State of GPT” talk by Andrej Karpathy, AI researcher and founding member of OpenAI. As Karpathy says, reinforcement learning is useful because “it’s easier to discriminate than to generate.”
When you use supervised learning in the context of a Large Language Model (LLM), you need to collect prompts with the corresponding good answers. But coming up with these good answers may be very challenging. To take an example from the talk above, if the prompt is "write a haiku about paperclips," it would take some effort to create a proper answer.
With reinforcement learning, you don't need a correct answer. You just need some reward. For example, you can have an LLM generate two answers and have a human annotator rank these answers. This feedback can be the reward (actually, it's more complicated because — in fact — we would use this feedback to create a reward model which would then estimate rewards that will be given to the reinforcement learning algorithm).
Depending on the task, moving from "you need to come up with the right answer" to "you need to rank these answers" can greatly facilitate the data collection effort to train an LLM. This principle is called reinforcement learning from human feedback, or training a reward model directly from human feedback and using the model as a reward function to optimize an agent’s policy using reinforcement learning.