IBM Z Day on Nov. 21: Discover the ideal environment for modern, mission-critical workloads. Learn more

Train a software agent to behave rationally with reinforcement learning

Reinforcement learning is a subfield of machine learning that you can use to train a software agent to behave rationally in an environment. The agent is rewarded based on the actions it takes within the environment. One example of learning comes from 1992, when IBM’s Gerry Tesauro used reinforcement learning to build a self-learning backgammon player. This article explores reinforcement learning, some problem areas to which you can apply it, and demonstrates the technology by using a simple simulation.

Many algorithms in machine learning fall into one of two categories: unsupervised or supervised learning. In unsupervised learning, an algorithm segregates the unlabeled data into groups based on their underlying structure, which makes the data understandable. An example of unsupervised learning is the k-means clustering algorithm, which partitions data into clusters with the nearest mean. On the other side of the spectrum is supervised learning. In supervised learning, a set of input variables is mapped to a predefined set of output variables through a mapping function (with the goal of approximating new input data for prediction). This process is categorized as supervised because an algorithm uses the input data and output data to train a mapping function. Neural networks are an example of supervised learning.

Reinforcement learning is different from either of these learning approaches but sides more with the supervised end of the spectrum. In reinforcement learning, the mapping of state to action is learned through a cumulative reward or punishment for its actions. The mapping takes place online, through a balance of exploration (trying new actions for a given state) and exploitation (by using existing knowledge of the state/action mapping). The wanted result is an optimal state-to-action policy that maximizes some overall reward as seen below.

A single line showing the spectrum of learning types, with supervised learning on the left end, unsupervised learning on the right end, and reinforcement learning just left of center

Let’s begin with a quick history of reinforcement learning. From there, I’ll show a sample implementation of reinforcement learning—in particular, the Q-learning technique.

A history of reinforcement learning

Reinforcement learning has its roots in animal psychology: learning by trial and error. Early artificial intelligence (AI) researchers thought that this mechanism could be implemented within machines and used to learn how to map states (the environment) to actions. Marvin Minsky built the earliest example of reinforcement learning in 1951 to mimic a rat learning how to solve a maze (implemented with vacuum tubes that represented the 40 neurons of the simulated rat brain). As the robotic rat solved the maze, the synapses were reinforced based on its ability to escape.

Forty years later, reinforcement learning saw some success. In 1992, IBM researcher Gerald Tesauro developed a backgammon player called TD-Gammon by using reinforcement learning. Tesauro used temporal-difference learning (called TD lambda) to train a neural network that consisted of 80 hidden units. TD-Gammon learned the game of backgammon without knowledge of the game, building its expertise through self-play. TD-Gammon played at the level of top human players and identified new strategies for playing the game.

IBM applied reinforcement learning to IBM Watson®, a question-and-answer system that understands natural language and can respond in natural language. Specifically, IBM applied reinforcement learning to the strategy of IBM Watson playing Jeopardy!, such as whether to attempt an answer to a question (based on its formulated answers and degree of certainty), which square to select on the board, and how to wager in the game (in particular, the daily doubles). IBM Watson defeated former Jeopardy! champions in 2011.

In 2016, Google applied reinforcement learning to build a Go-playing program. Go is notoriously difficult to develop for given the scale of the possible moves (a 19×19 board with two types of stones). An opening move in chess has 20 possible options, but Go has 361 possible opening moves. Google’s AlphaGo learned by analyzing the moves of professional players, and then played itself to increase its expertise (although it’s now fully trained by self-play). Using deep reinforcement learning, AlphaGo defeated a Korean Go grandmaster.


Now, let’s look at a type of reinforcement learning called Q-learning, and then build a simple simulation to demonstrate it.

Consider an agent in a 2-dimensional world. The world is made up of cells that the agent can occupy (the states). In each state are actions that the agent can perform—namely, move in one of four directions to a new cell. When the agent moves, it can receive a reward (or a reward in the future that was received based on this initial move). Given this reward, it can assign a preference to the action for a state so that the state is preferred in the future (as the goal of the agent is to maximize its reward). The learning process is then identifying the optimal action for each state that provides the highest reward. Identifying the policies of actions for given states is the process of learning.

In Q-learning, each state-action pair is assigned a Q-value, which represents the sum of reinforcements that are calculated by a Q-value function. This work is performed as the agent explores the environment. When a particular action is performed in a given state, a reward is received, and it updates the Q-value for the state–action pair to remember this. As the agent wanders, the Q-values for state–action pairs are refined so that later (post-learning) the agent can apply the best action for a given state (given its Q-value). Note that during learning, the agent can probabilistically choose an action for a given state (as a function of each action’s Q-value in relation to the sum of Q-values). This process allows the agent to prefer an action for a given state (to use its learned knowledge) but sometimes choose a less optimal action for exploration as seen below.

Process diagram showing the process loop between the site/feedback and action for the agent and environment based on the environment, state, action, and Q-value

The Q-value function incorporates a couple of factors that you can use to tailor its operation. The first is a learning rate (alpha), which defines how much of a new Q-value will override the old one. A value of 0 means that the agent won’t learn anything (old information is all that’s important), where a value of 1 means that newly discovered information is the only thing that’s important. The next factor is called the discount factor (gamma), and it defines the importance of future rewards. A value of 0 means that only short-term rewards are considered, where a value of 1 puts more importance on long-term rewards. After a move is selected (action for a given state), the Q-value for that action is updated by using the current Q-value and the reward and max Q-value from the target state as seen below.

Q-learning equation showing the learning and discount factors, the reward, the new and current values, and the future value estimate

The state–action pair policies are created in episodes. When the agent reaches the final state, the episode ends and a new episode can begin. Training can also be bounded by epochs (steps by the agent in the environment). Note that states not entered and actions not tried are not recorded in the state–action table. If an agent doesn’t explore some portion of the state-space, it will have no idea of the benefit (or punishment) that awaits it there. In problems with massive state-spaces, it’s possible to find a solution without exploring the entire state-space. The simple flow in the following figure illustrates the learning process within Q-learning.

Flow chart showing the Q-learning process

From here, we’ll look at a simple implementation of Q-learning in which an agent learns to navigate an obstacle-filled environment to find a reward.

Sample implementation

This sample implementation can be found on GitHub. In this simple implementation, I create an environment of 20×20 cells. Each cell can contain an obstacle (such as a wall or impassible object), open space (with a reward value of 0), and the goal cell (with a reward value of 1). The agent begins at the upper-left cell, and then uses Q-learning to find an optimal path to the goal, which is displayed at the end of the learning process.

Let’s start with the important structures I used for this Q-learning implementation. The first is the state–action structure, which contains the Q-values for the actions of a given state. The MAX_ACTIONS symbol represents the four actions that an agent can take within any given state (0 = North, 1 = East, 2 = South, 3 = West). QVal represents the Q-values for each of the four actions, and QMax is the largest Q-value (used to determine the best path when the agent uses its learned knowledge). This structure is instantiated for each cell in the environment. An environment is also created, which is represented as characters (|, +, -, # as obstacles; ‘ ‘ as open space; and $ as the goal state):

typedef struct {
   double QVal[ MAX_ACTIONS ];  // Q‑value for each action.
   double QMax;                 // Maximum Q‑value.
} stateAction_t;
stateAction_t stateSpace[ Y_MAX ][ X_MAX ];
char environment [ Y_MAX ][ X_MAX ]={};

Let’s start from the top with the main function. This function implements the Q-learning loop, where the agent randomly tries actions; updates the Q-values; and, after some number of epochs, shows the best path found to the goal:

int main()
   pos_t agent = start;

   srand( time( NULL ) );

   // Init the state/action Q data
   initStateSpace( );

   // Iterate a maximum number of steps.
   for ( int epochs = 0 ; epochs < MAX_EPOCHS ; epochs++ )
      // Select an action for the agent based on the desired policy.
      int action = ChooseAgentAction( &agent, EXPLORE );

      // Update the agent based upon the action.
      UpdateAgent( &agent, action );

   // Show the agent's path
   ExecuteAgent( );

   return 0;

Next, the ChooseAgentAction function selects the next action for the agent based on the wanted policy (explore versus exploit). For the exploit policy, the function identifies the action with the largest Q-value and returns that action. This choice represents the best action for this state and mimics the agent exploiting its knowledge to move quickly toward the goal. For the explore policy, I simply get a random action, and then return this action if it’s a legal action (moves the agent to the goal or open space and not into an obstacle). Another approach to explore is to select the action probabilistically as a function of the QVal (using QVal as a probability over the sum of Q-values for the state):

int ChooseAgentAction( pos_t *agent, int actionSelection )
   int action;

   // Choose the best action (largest Q‑value)
   if ( actionSelection == EXPLOIT )
      for ( action = 0 ; action < MAX_ACTIONS ; action++ )
         if ( stateSpace[ agent‑>y ][ agent‑>x ].QVal[ action] ==
              stateSpace[ agent‑>y ][ agent‑>x ].QMax )
   // Choose a random action.
   else if ( actionSelection == EXPLORE )
        action = getRand( MAX_ACTIONS );
      } while ( !legalMove( agent‑>y, agent‑>x, action ) );

   return action;

The last function of this simulation I review is called UpdateAgent, and it implements the core of the Q-learning algorithm. Note that this function is provided with the wanted action (a direction to move), which I use to identify the next state to enter. The reward from this state is extracted from the environment, and then used to calculate the updated Q-value from the current state (and cache the QMax value, in case it has changed). If the new state entered is the goal state, I place the agent back at the initial state to restart the learning process:

void UpdateAgent( pos_t agent, int action )
   int newy = agent‑>y + dir[ action ].y;
   int newx = agent‑>x + dir[ action ].x;
   double reward = (double)getReward( environment[ newy ][ newx ] );

   // Evaluate Q value
   stateSpace[ agent‑>y ][ agent‑>x ].QVal[ action ] +=
      ( reward + ( DISCOUNT_RATE * stateSpace[ newy ][ newx ].QMax) ‑
                 stateSpace[ agent‑>y ][ agent‑>x ].QVal[ action ] );

   CalculateMaxQ( agent‑>y, agent‑>x );

   // Update the agent's position
   agent‑>x += dir[ action ].x;
   agent‑>y += dir[ action ].y;

   // If agent has reached the goal, move it back to the initial state
   if ( ( agent‑>x == goal.x ) && ( agent‑>y == goal.y ) )
      agent‑>x = start.x; agent‑>y = start.y;


This small amount of code (and a few others that you can review at GitHub) implements Q-learning for this demonstration. You can build the code by using make, and then run it by using the program name qlearn. The following illustration shows the output of the code and the path selected as . characters. I performed this in the context of the ExecuteAgent function (not shown), which uses the exploit policy to choose the best path:

$ ./qlearn 

+ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ + 
| . .             # # #   # # # # #   | 
|   .   # # #      # # #  # # # # #   | 
| # .     # # #           #   # #     | 
| # .       # # # # # # # #           | 
| # .       # # # # # # # #   # #     | 
|   .   # # #       # #       # # #   | 
|   . # # # # #      # #      # # #   | 
|   . # #                       # #   | 
|   . . . . . . . .   # # #   # # #   | 
|   #      # # # .   # # #    # # # # | 
| # # #    # # # . . . # #    # $ . . | 
| # # #  # # # #      . . . . # # # . | 
| # # #    # # #    # # # # . # # # . | 
| # # #      # #  # # #     . . # # . | 
| # #          #  #         # . # # . | 
| #        # #          # # # . # # . | 
|       # # # #         # # # . . . . | 
|       # # # # #                     | 
+ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ ‑ + 


Going further

Reinforcement learning, inspired by behavioral psychology, is a useful machine learning technique that you can use to identify actions for states within an environment. The approach can allow an agent to learn to interact in the environment for some cumulative reward. This article explored Q-learning, where the algorithm requires no model to understand the environment and builds an action selection policy. You can use this approach to solve a wide range of problems.

Another approach to reinforcement learning is State-Action-Reward-State-Action, which has many similarities to Q-learning. Both have similarities to learning classifier systems, which build context-dependent rules for complex solution spaces.

Temporal difference learning is a prediction method within reinforcement learning that also works on the idea that predictions can be learned from observations in an environment. But, variants of temporal difference learning can update prior state-action Q-values instead of just the current.

Temporal difference learning began with work from Arthur Samuel (who famously developed the AI for a checkers-playing program at IBM on the IBM® 701 platform) and continued in the games domain with TD-Gammon from Gerald Tesauro at IBM in 1992. Google continued this trend with DeepMind, using deep reinforcement learning to play and beat human players’ scores for 23 of 49 Atari 2600 games (such as Atari Breakout). In this model, the image of the video game that the player sees is applied to the machine learning algorithm, and the algorithm emits the next play (such as moving the player) to the game.

M. Tim Jones