Tech Accelerator What is GenAI? Generative AI explained

Prev Next

Definition

What is Q-learning?

Sean Michael Kerner

Published: Nov 21, 2024

Q-learning is a machine learning approach that enables a model to iteratively learn and improve over time by taking the correct action. Q-learning is a type of reinforcement learning.

With reinforcement learning, a machine learning model is trained to mimic the way animals or children learn. Good actions are rewarded or reinforced, while bad actions are discouraged and penalized.

With the state-action-reward-state-action form of reinforcement learning, the training regimen follows a model to take the right actions. Q-learning provides a model-free approach to reinforcement learning. There is no model of the environment to guide the reinforcement learning process. The agent -- which is the AI component that acts in the environment -- iteratively learns and makes predictions about the environment on its own.

Q-learning also takes an off-policy approach to reinforcement learning. A Q-learning approach aims to determine the optimal action based on its current state. The Q-learning approach can accomplish this by either developing its own set of rules or deviating from the prescribed policy. Because Q-learning may deviate from the given policy, a defined policy is not needed.

This article is part of

What is GenAI? Generative AI explained

Which also includes:
8 top generative AI tool categories for 2025
Will AI replace jobs? 18 job types that might be affected
27 of the best large language models in 2025

Off-policy approach in Q-learning is achieved using Q-values -- also known as action values. The Q-values are the expected future values for action and are stored in the Q-table.

Chris Watkins first discussed the foundations of Q-learning in a 1989 thesis for Cambridge University and further elaborated in a 1992 publication titled Q-learning.

How does Q-learning work?

Q-learning models operate in an iterative process that involves multiple components working together to help train a model. The iterative process involves the agent learning by exploring the environment and updating the model as the exploration continues. The multiple components of Q-learning include the following:

Agents. The agent is the entity that acts and operates within an environment.
States. The state is a variable that identifies the current position in an environment of an agent.
Actions. The action is the agent's operation when it is in a specific state.
Rewards. A foundational concept within reinforcement learning is the concept of providing either a positive or a negative response for the agent's actions.
Episodes. An episode is when an agent can no longer take a new action and ends up terminating.
Q-values. The Q-value is the metric used to measure an action at a particular state.

Here are the two methods to determine the Q-value:

Temporal difference. The temporal difference formula calculates the Q-value by incorporating the value of the current state and action by comparing the differences with the previous state and action.
Bellman's equation. Mathematician Richard Bellman invented this equation in 1957 as a recursive formula for optimal decision-making. In the q-learning context, Bellman's equation is used to help calculate the value of a given state and assess its relative position. The state with the highest value is considered the optimal state.

Q-learning models work through trial-and-error experiences to learn the optimal behavior for a task. The Q-learning process involves modeling optimal behavior by learning an optimal action value function or q-function. This function represents the optimal long-term value of action a in state s and subsequently follows optimal behavior in every subsequent state.

Bellman's equation

Q(s,a) = Q(s,a) + α * (r + γ * max(Q(s',a')) - Q(s,a))

The equation breaks down as follows:

Q(s, a) represents the expected reward for taking action a in state s.
The actual reward received for that action is referenced by r while s' refers to the next state.
The learning rate is α and γ is the discount factor.
The highest expected reward for all possible actions a' in state s' is represented by max(Q(s', a')).

What is a Q-table?

The Q-table includes columns and rows with lists of rewards for the best actions of each state in a specific environment. A Q-table helps an agent understand what actions are likely to lead to positive outcomes in different situations.

The table rows represent different situations the agent might encounter, and the columns represent the actions it can take. As the agent interacts with the environment and receives feedback in the form of rewards or penalties, the values in the Q-table are updated to reflect what the model has learned.

The purpose of reinforcement learning is to gradually improve performance through the Q-table to help choose actions. With more feedback, the Q-table becomes more accurate so the agent can make better decisions and achieve optimal results.

The Q-table is directly related to the concept of the Q-function. The Q-function is a mathematical equation that looks at the current state of the environment and the action under consideration as inputs. The Q-function then generates outputs along with expected future rewards for that action in the specific state. The Q-table allows the agent to look up the expected future reward for any given state-action pair to move toward an optimized state.

What is the Q-learning algorithm process?

The Q-learning algorithm process is an interactive method where the agent learns by exploring the environment and updating the Q-table based on the rewards received.

The steps involved in the Q-learning algorithm process include the following:

Q-table initialization. The first step is to create the Q-table as a place to track each action in each state and the associated progress.
Observation. The agent needs to observe the current state of the environment.
Action. The agent chooses to act in the environment. Upon completion of the action, the model observes if the action is beneficial in the environment.
Update. After the action has been taken, it's time to update the Q-table with the results.
Repeat. Repeat steps 2-4 until the model reaches a termination state for a desired objective.

What are the advantages of Q-learning?

The Q-learning approach to reinforcement learning can potentially be advantageous for several reasons, including the following:

Model-free. The model-free approach is the foundation of Q-learning and one of the biggest potential advantages for some uses. Rather than requiring prior knowledge about an environment, the Q-learning agent can learn about the environment as it trains. The model-free approach is particularly beneficial for scenarios where the underlying dynamics of an environment are difficult to model or completely unknown.
Off-policy optimization. The model can optimize to get the best possible result without being strictly tethered to a policy that might not enable the same degree of optimization.
Flexibility. The model-free, off-policy approach enables Q-learning flexibility to work across a variety of problems and environments.
Offline training. A Q-learning model can be deployed on pre-collected, offline data sets.

What are the disadvantages of Q-learning?

The Q-learning approach to reinforcement model machine learning also has some disadvantages, such as the following:

Exploration vs. exploitation tradeoff. It can be hard for a Q-learning model to find the right balance between trying new actions and sticking with what's already known. It's a dilemma that is commonly referred to as the exploration vs. exploitation tradeoff for reinforcement learning.
Curse of dimensionality. Q-learning can potentially face a machine learning risk known as the curse of dimensionality. The curse of dimensionality is a problem with high-dimensional data where the amount of data required to represent the distribution increases exponentially. This can lead to computational challenges and decreased accuracy.
Overestimation. A Q-learning model can sometimes be too optimistic and overestimate how good a particular action or strategy is.
Performance. A Q-learning model can take a long time to figure out the best method if there are several ways to approach a problem.

For more information on generative AI-related terms, read the following articles:

What is the Fréchet Inception Distance (FID)?

What is a generative adversarial network (GAN)?

What is an inception score (IS)?

What is prompt engineering?

What is a large language model (LLM)?

What is generative design?

What is ChatGPT?

What is a transformer model?

What is multimodal AI?

What is synthetic data?

What is reinforcement learning from human feedback (RLHF)?

What is deepfake AI (deep fake)?

What are some examples of Q-learning?

Q-learning models can improve processes in various scenarios. Here are a few examples of Q-learning uses:

Energy management. Q-learning models help manage energy for different resources such as electricity, gas and water utilities. A 2022 report from IEEE provides a precise approach for integrating a Q-learning model for energy management.
Finance. A Q-learning-based training model can build models for decision-making assistance, such as determining optimal moments to buy or sell assets.
Gaming. Q-learning models can train gaming systems to achieve an expert level of proficiency in playing a wide range of games as the model learns the optimal strategy to advance.
Recommendation systems. Q-learning models can help optimize recommendation systems, such as advertising platforms. For example, an ad system that recommends products commonly bought together can be optimized based on what users select.
Robotics. Q-learning models can help train robots to execute various tasks, such as object manipulation, obstacle avoidance and transportation.
Self-driving cars. Autonomous vehicles use many different models, and Q-learning models help train models to make driving decisions, such as when to switch lanes or stop.
Supply chain management. The flow of goods and services as part of supply chain management can be improved with Q-learning models to help find the optimized path for products to market.

Q-learning with Python

Python is one of the most common programming languages for machine learning. Beginners and experts commonly use Python to apply Q-learning models. For Q-learning and any data science operation in Python, users need Python to write on a system with the NumPy (numerical Python) library that provides support for mathematical functions to use with AI.

With Python and NumPy, Q-learning models are set up with a few basic steps:

Define the environment. Create variables for states and actions to define the environment.
Initialize the Q-table. The initial condition of the Q-table is set to zero.
Set hyperparameters. Set parameters in Python to define the number of episodes, learning and exploration rate.
Execute Q-learning algorithm. The agent selects an action either randomly or based on the highest Q-value for the current state. After the action is taken, the Q-table is updated with the results.

Q-learning application

Before applying a Q-learning model, it's critical to first understand the problem and how Q-learning training can be applied to that problem.

Set up Q-learning in Python with a standard code editor or an integrated development environment to write the code. To apply and test a Q-learning model, use a machine learning tool, such as the Farama Foundation's Gymnasium. Other common tools include the open source PyTorch machine learning application framework to support reinforcement learning workflows including Q-learning.