Machine learning and gaming

Artificial intelligence has been applied to games since the very early days — from traditional games like checkers to modern-day real-time strategy games like StarCraft II. In this article, we’ll explore how AI has been applied to turn-based and real-time games and what’s on the cutting edge of applied machine learning for games.

Games have a long history in the co-evolution of AI and machine learning. The construction of synthetic worlds in games has served as a useful testbed for AI algorithms. Additionally, applying AI in games permits the construction of competitive opponents to challenge players.

Early game AI

The first known example of AI in gaming was inspired as a demonstration of computing at the Festival of Britain. The custom-built Nimrod computer by Ferranti was introduced in 1951 and used the game of Nim to demonstrate its mathematical capabilities. Nim is a two-player game where players alternate in removing one to three objects from a collection of objects. In one variation, the player that removes the last object is the winner. The Nimrod used 480 vacuum tubes, the precursor to transistors, for processing and display.

While Nimrod was custom-designed for playing Nim, Ferranti believed that building a machine to play a relatively complex game would translate into solving complex problems. There was no program in the Nimrod, but rather a hardwired set of logic to play. Nimrod lacked conventional AI, but its heuristics to play Nim made it competitive to play, which surprised and frightened players and onlookers.

Complex games

One of the earliest applications of traditional machine learning in games was implemented by Arthur Samuel of IBM (who coined the term “machine learning”). In 1956, Samuel took on the challenge of checkers, which includes simple play as well as complex strategy. His development on IBM’s first commercial scientific computer, the IBM 701, created two key concepts in machine learning for games. The first was the invention of alpha-beta pruning.

Patrick Winston described the basic idea of alpha-beta like this: “If you have an idea that is surely bad, don’t take the time to see how truly awful it is.”

Alpha-beta pruning is a tree-search optimization for the Minimax algorithm used to minimize the number of nodes evaluated within a tree — in this case, in the context of two-player games. In a nutshell, the Minimax algorithm attempts to maximize the minimum gain as the computer chooses a move (represented as a tree of the game state space, where each level of the tree is the player or computer’s move). Alpha-beta prunes away portions of the tree from evaluation given that they don’t improve a player’s reward. In the example shown in Figure 1, only the middle move will advance the X player toward a win (assuming both players play optimally).

minimax tree for tic-tac-toe

Figure 1. Minimax tree for Tic-Tac-Toe

The second key concept invented by Samuel was self-play with rote learning. Samuel pitted his program against itself to improve its play, as well as remembering each move it had encountered (dutifully stored on magnetic tape), along with its terminal “reward” value to optimize play. His program played checkers at the amateur level, but introduced concepts still in use today.

Forty years later, the alpha-beta search approach returned to the game of chess. In 1996, IBM’s Deep Blue defeated Garry Kasparov, a grandmaster in chess. Deep Blue performed searches of the game-state tree, pruned with alpha-beta pruning, but in parallel to increase the speed to identify the computer’s next move (searching 200 million positions per second up to 20 moves in the future).

In 2015, Google applied machine learning to the game of Go, using deep learning (neural network) for move selection and Monte Carlo search for applying previously learned moves. AlphaGo beat a human professional player in 2015 and the No. 1 ranked player in 2017. Interestingly, AlphaGo reapplied self-play to strengthen its play (created by Samuel for checkers almost 60 years before), as well as storing previously known moves.

Video game AI

Early video games lacked any concept of AI in their game opponents, relying on state machines for predictable movement (such as in Space Invaders). In 1980, Pac-Man added some complexity to the enemies for path-finding to the player (or away from the player in the case of escape). Each enemy also included some variability in personality to make less predictable.

Now let’s explore some of the traditional methods that are applied to video games.

Finite state machines

Finite state machines (FSMs), while not really part of AI, have been a part of games for a long time, providing the basis for what appears to be intelligent behavior. A finite state machine is a model for computation that permits changing states based upon some external input. In the context of games, FSMs provide varying behaviors for game entities. Figure 2 illustrates a non-player character (NPC) that can be assigned a task by the player. Once assigned, the NPC carries out the task based varying behaviors necessary to complete it.

finite state machine for an rts worker npc

Figure 2. Finite state machine for an RTS worker NPC

The FSM shown in Figure 2 represents a worker NPC that is assigned a resource gathering task by the player. After it’s assigned, the NPC finds the desired resource, gathers it, then delivers the resource to a collection point. This process continues until all resources are exhausted, where the NPC will become idle.


A common element of games is the ability for NPCs to navigate their environment. This capability is typically called path-finding and has a number of solutions. One of the most common is called A. The A algorithm is a variant of Dijkstra’s graph-based path-finding algorithm and views the locations an NPC can go as nodes in a graph (see Figure 3).

A* operates with two lists: an open list and a closed list. The open list contains all nodes not yet visited, and the closed list contains those nodes that have been visited. Starting with a node from the open list, the algorithm finds those nodes accessible from this node and adds them to the open list (unless they’ve already been added to the closed list). A score computed for the current node based upon a value for the node (some nodes could be more expensive to travel through than others) and a heuristic from the current node (minimum distance from the node to the goal). This process continues until the goal is reached. Once there, the algorithm simply follows the decreasing score of the node back to the starting node to determine the path.

implements graph search to identify the best path

Figure 3. A* implements graph search To identify the best path

A* is one option for path-finding, but other algorithms exist. An original path-finding algorithm in applications with large maps and small CPUs would cluster the environment into larger nodes. This approach, called hierarchical path-finding, resulted in fewer nodes, which minimized the search required.


Relic Entertainment’s Age of Empires included a rules-based system that drove the behavior of the opponent. Age of Empire is a real-time strategy game where the player builds a civilization, gathering resources, building technologies, building a military, and then defeating an opponent. Age of Empires developed a rules-based system to control the enemy civilizations.

Decision trees are a related approach to rules-based systems. Decision trees encode decisions (or tests) to consequences (such as the action an NPC takes in a game).

In a rules-based system, the rules make up the knowledge base that encodes an expert’s knowledge. The inference engine will scan the rules to identify those rules that can be satisfied and resolve any conflicts that emerge (if multiple rules can be executed, pick one using conflict resolution). Finally, the resolved action is executed. The following example shows a rule that identifies that if a castle can be built (the condition), then a castle should be built (the action).

  (can-build castle)
  (build castle)

The determination of whether a castle can be built is based upon the condition can-build, which validates that the necessary resources for a castle are available. If the condition is satisfied, the action is taken (using the form condition => action).

Procedural generation

Content within games like their environments are important for gameplay, and certain types of games (such as Rogue) enhance replayability by randomly creating the environment for each level. Games like Minecraft create their massively detailed worlds through procedural generation, and other games like Dwarf Fortress create not only their environment but also the history and lore for that instance of the game.

Approaches to procedural generation are wide and varied, but you can find fractals as well as use of noise sources, such as Perlin noise, to create more natural textures or landscapes. The use of procedural generation for landscapes has become so common games that it has its own name: procedural terrain generation.

An interesting recent twist on procedural terrain generation is in the use of style transfer. Using deep convolutional neural networks, deep learning CNNs, and a particular art style (such as Van Gogh’s “Starry Night”), a terrain can be automatically transferred to a new style. The terrain is used as input to the CNN, which was previously trained using Van Gogh’s style with the output being a new scene filtered from the original. This is a time-expensive process, so it’s not something that can be done in real time, but is used to render for a given style during game development.

Simple learning AI

Recall that machine learning is often used as a test-bed for algorithms. Let’s now look at how machine learning algorithms have been applied to simple video games.

Atari’s Pong appeared in 1972 and was one of the first successful video games simulating the game of pingpong. This simple game used paddles to “hit” the ball with the paddle split into eight segments to change the angle at which the ball was repelled.

Andrej Karpathy applied deep reinforcement learning (RL) to build a Pong player (as shown in Figure 4). In RL, the the algorithm senses the environment and performs an action in response (assuming a delayed reward). In this implementation, the RL’s environment is the differences between two snapshots of the game screen in time (pixel differences) which the neural network observes as motion. After each move, the reward from the environment is 1 if the opponent misses the ball, -1 if the AI player misses the ball, and otherwise the reward is 0.

The simple neural network includes a hidden layer of 200 units, which accepts the delta of the image (environment) input. The input is feed forward to the output layer, which then determines the probability for moving the paddle up or down. As the reward is captured, the network is updated in a supervised learning fashion using policy gradients.

building a pong player with a neural network

Figure 4. Building a Pong player with a neural network

A similar approach was applied to a more complicated game: Super Mario by Nintendo (see Figure 5). But rather than pool pixels down to a reduced version and feed the image deltas into a feed-forward neural network, Mario was solved with a deep convolutional neural network (CNN) and Q-learning. In addition to the current image of the video game screen, the previous four images were input, in addition to the prior controller inputs. The outputs represented the controller inputs to the game.

learning to play Super Mario with deep Q-learning

Figure 5. Learning to play Super Mario with Deep Q-learning

The AI was rewarded not only for finishing the game but also when it moved right (as opposed to left, which was negatively rewarded). The CNN in this example was very deep and constituted around 6.6 million parameters.

Real-time play complex AI

The final example we’ll explore is the development of ML players of the game Dota 2. Dota 2 is a real-time strategy game where two teams of five players each work to destroy a large structure called an ancient. Each team works to destroy the other team’s ancient while protecting their own. The game is made up of different heroes who have differing play styles (carry vs. support). Heroes level up to increase their ultimate skills and acquire gold to purchase items that have special properties. An interesting property of Dota 2 is that it requires significant teamwork and cooperation in order to win.

OpenAI Five developed a Dota 2 AI that was able to defeat the reigning Dota 2 e-sports world champions in April 2019. Their AI was created from scratch and included no inherent knowledge of the game or heuristics to guide its play. It learned to play the game through self-play, starting with random weights for its neural networks. The trained AIs (of which there are five for each hero) played 80 percent of the games against each other and 20 percent on past AIs. Using 128,000 CPU cores and 256 GPUs, the AIs collectively play 180 years’ worth of games every day. Each AI performs around 20,000 moves per game, meaning a long horizon to gather fitness information about an AI instance.

OpenAI Five is actually five different neural networks that coordinate through a hyperparameter called team spirit. Team spirit indicates whether the individual AIs should focus on their individual reward (such as killing other heroes and collecting treasure) or focus on the overall team reward.

The AI can be split into three sections: perception, processing, and action selection (see Figure 6). Perception includes the various inputs from the environment fed into an embedding that is max-pooled to feed into the processing center. The long short-term memory (LSTM) is a recurrent neural network that is Turing complete in that it can compute anything similar to a general-purpose compute. LSTMs are powerful at learning sequences and are, therefore, ideal in this application. The LSTM then outputs to a set of action networks that choose an action to perform (along with other pertinent information, such as target). This takes into consideration the actions that can be selected in addition to the attention of the particular hero.

Dota-2 hero AI for long horizon learning

Figure 6. Dota 2 hero AI for long horizon learning

The inputs to the network consist of around 20,000 real values, so the perception space of the agent is quite large. The actions that can be selected is also quite large, around 170,000 total unique actions (with about 1,000 that are typically possible, minimized due to cooldowns, items not held, and so on).

The OpenAI Five used proximal policy optimization for learning, which has been found to be simple to use and has good performance in production applications.

Looking forward

AI is changing the way we play games, as well as how games are made. The methods applied have evolved, but some approaches remain constant to this day, such as Arthur Samuel’s idea of self-play to building learning agents. From checkers and the IBM 701 to complex real-time games trained using distributed networks of CPUs and GPUs, machine learning is an integral part of games and a test-bed for future machine learning approaches.