Learning to Plan

Ignasi Andres
6 min readApr 18, 2022


Using Reinforcement Learning to play Warhammer 40K.

Photo by Artyom Kabajev on Unsplash


In this blog we have always tried to show different computer science and artificial intelligence algorithms to play wargames. We have seen how to use automated planning to play Warhammer 40K, and how to select the best army. Now, I would like to start a series of posts about using a different technique, Reinforcement Learning. I will also divide the topic in several posts, because this is a very fascinating subject, and I would like to give it full attention, while also creating a good example. As always with this blog, the idea is to first introduce a bit of theory, and in the next posts, use some wargame examples to illustrate how this technique works.

Since we have already presented the 40K problem, we will focus on how we approach this problem to better use Reinforcement Learning. First we will (briefly) describe the theoretical approach, to better understand how to model the wargame problem as a reinforcement learning problem.

Reinforcement Learning

Schema of an agent acting in an environment and receiving a reward
Figure 1: Agent acting in the environment and receiving a reward.

Reinforcement learning is an artificial intelligence technique, in which an agent uses rewards to learn how to act in an environment. Let us break this sentence in parts. First, we have an agent, that is an entity that will have to act in an environment, to perform a task. Second, we have the rewards, that indicate how good or bad can be certain actions in the environment. Rewards can be positive (rewards) or negative (costs), and if we set the goal to maximize the rewards in the end (or minimize the costs) the agent will execute the actions, and obtain the policies, learning which actions are better and which actions are not good for a situation. The learning function will create a policy, that is a mapping between states (where the agent is now) and which actions (what the agent must do) maximize the reward.

As Figure 1 shows, the agent must execute an action to obtain the reward. This is a difference between planners, because most planning algorithms choose an action before the execution already knowing its outcome. Even online planners, that compute a plan and then execute it until the plan is no longer feasible, they compute a whole plan with a fully defined reward function.

As the agent is situated in an environment, the agent do not know what the rewards are, and in some case, it might not even know how the actions will modify the environment. As Russell and Norvig wrote it in their book:

Imagine playing a game where you do not know the rules, and after several moves, the opponent says: you lose!”.

So one of the main problems in reinforcement learning comes from not knowing the environment (or the real underlying distribution), and having to figure it out by performing multiple trials, until the reward values converges. In other words, the agent must keep executing actions in the environment until the rewards do not change, and in this case the agent creates the policy by selecting the actions that maximize the reward.

Example of a problem of Reinforcement Learning using wargames.

Figure 2: (left) an agent deployed in the battlefield; (right) possible actions for the agent.
Figure 2: (left) an agent deployed in the battlefield; (right) possible actions for the agent.

To illustrate these concepts, let us consider the following mini skirmish game (a tiny-mini-wargame): an agent representing an stranded soldier on the battlefields of the far future, must take an action to secure and objective, survive, or escape its situation. The agent can fire, can sit on an objective or it can do nothing. As Figure 2 (left) shows, this soldier is located behind a wall, and has several options. However, having no direct orders from the high command, it is lost and do not know which action it is best. Assuming the agent knows its environments, that is, the problem is fully observable, the soldier has only four options (Figure 2 right):

  1. Move north to secure the objective (represented by a green arrow);
  2. Left cover and engage the unit at the north (orange arrow);
  3. Left cover and engage the lonely unit at the south (blue arrow); and
  4. Do nothing.

So, now the agent has decided to act, and it performs the action. In this case it only sits. Let us assume that it has decided the following, and reports it by radio to its sergeant or another superior officer:

  • Agent: I am in a position at X,Y.
  • Sergeant: Are there enemies near you?
  • Agent: I can sense that there are two enemy units near.
  • Sergeant: Is there an objective near?
  • Agent: There is one objective north of my position, clear of enemies.
  • Sergeant: And what have you decided to do?
  • Agent: I decided to do nothing (4th option).
  • Sergeant: . . .

Please, note that the agent is communicating the action a posteriori, that is, the agent has already performed the action. This is a difference with the planning approach, in which the agent just considers the possible actions evaluating its expected outcomes. In reinforcement learning, the agent acts first and then obtain its reward. The sergeant acts as an example of a reward function. And, according to the conversation, the agent took a neutral action, because the sergeant did not respond anything, neither good nor bad, that is, the action has a reward of 0. Let us break down what happened:

  • The agent is situated in a state (s), and the agent chooses an action (a): to do nothing
  • The agent executes the selected action and modifies the current situation, by entering another state (s’).
  • The agent obtains a reward (r(s’)), according to the outcome of the chosen action (a).
  • Start again.

Now let us consider other actions.

  • Agent: I am in position…bla bla…
  • Agent: I moved north and secure the objective
  • Sergeant: Good!

In this case, moving north to secure the objective is a good action. And it does not involve any possible bad outcome, since the enemy is not moving (yet). In this case, we must assume the reward is a positive one, let us say 1.

In the next example, the agent decides to move north, to engage the unit posted there (I am not posting the conversation again, but you get the idea). Executing this action may be a bad idea, since it is highly probable that the agent ends captured or worse, even if there is a (remote) possibility that the agent wins a high reward if he manages to wound the unit and escape. So this action yields a negative reward. And the same goes for the action to engage the lonely unit in the south, but in this case the probabilities should be better (is it a single space marine? or a captain? an assassin looking for menaces from afar?).

Vindicare assasin from Warhammer 40k.
Vindicare Assassin from Warhammer 40k. In the far future it is better not to approach figures that are sitting alone.

Once the agent performs its action, it finishes its turn, and the reward is computed (positive or negative), and the sergeant collects the info about the environment (or situation/state), the action the agent took and its result. Then, imagine that the sergeant sends another agent to the same situation (if there is one thing the Imperium has in the future is manpower to waste). This new agent has all the same information, plus the action and result of the previous one, and it could either follow the same actions, or try a new action to generate a new reward. This is, in a nutshell, reinforcement learning.

Before ending, please note that this is a didactic example. If we consider the soldier to be the agent, and it manages to get killed, it has learnt nothing! (maybe a bit if it gets captured). So, please bear in mind that I was referring to a situation in a game, and that the agent should be the player that is learning from the result of the actions of the soldier. After several games, the player should know which actions are better and which actions are to be avoided.

Final thought and conclusions

In this post I introduced the technique of reinforcement learning. I presented a short introduction and showed an example of a small problem. But, what if some apparently bad actions yield better rewards in the future? Is it better to sacrifice a winning move to verify if a new actions is potentially a better one? In future posts we will address the tradeoff between exploration vs exploitation, and we will encode this a small problem to reinforcement learning solution for the Warhammer 40K problem we all know and love.



Ignasi Andres

I am a Über nerd, interested in Robotics, Machine Learning and Computer Science in general, as well as Entrepreneurship.