Multi-armed bandits applied to order allocation among execution algorithms

Author:Murphy | View: 22731 | Time: 2025-03-23 19:35:03

Confused robot observing three one-armed slot machines in Picasso style. Source: DALL-E 2.

Making decisions under uncertainty is a common challenge faced by professionals in various fields, including data science and asset management. Asset managers face this problem when selecting among multiple execution algorithms to carry out their trades. The allocation of orders among algorithms resembles the multi-armed bandit problem that gamblers face when deciding which slot machines to play, as they must determine the number of times to play each machine, the order in which to play them, and whether to continue with the current machine or switch to another. In this article, we describe how an asset manager can best distribute orders among available algorithms based on realized execution cost.

Dummy example

For each order, we take an action a to allocate to one of K algorithms

Eq. 1: Set of possible actions to allocate an order to one of K algorithms.

The value of action a is the expected execution cost for the algorithm

Eq. 2: (Unobserved) expected execution cost for action a, i.e. choosing a certain algorithm.

Suppose that K = 3 and the expected execution cost for the algorithms are

Eq. 3: (Unobserved) expected execution cost for three algorithms.

If you would know the action values a priori, it would be trivial to solve the problem. You would always select the algorithm with the lowest expected execution cost. Suppose now that we start allocating orders among the three algorithms as shown in Figure 1.

Figure 1: Example of order allocation among three algorithms and associated execution cost. Source: Author.

We still do not know the action values with certainty, but we do have estimates after some time t:

Eq. 4: (Observed) expected execution cost for action a conditional on the information up until time t.

We can for instance construct the empirical distribution of the execution cost¹ for each algorithm, as shown in Figure 2.

Figure 2: Empirical distribution of execution cost per algorithm after some time t. Source: Author.

Allocating all orders to the algorithm with the lowest expected execution cost may appear to be the best approach. However, doing so would prevent us from gathering information on the performance of the other algorithms. This illustrates the classical multi-armed bandit dilemma:

Exploit the information that has already been learned
Explore to learn which actions give the best outcomes

The objective is to minimize the average execution cost after allocating N orders:

Eq. 5: Objective function for order allocation problem.

Solving the problem using policies

To solve the problem, we need an action selection policy that tells us how to allocate each order based on current information S. We can define a policy as a map from S to a:

Eq. 6: Definition of an action selection policy.

We discuss the most well known policies² for the multi-armed bandit problem, which can be classified in the following categories:

Semi-uniform strategies: Greedy & ε-greedy
Probability matching strategies: Upper-Confidence-Bound & Thompson sampling

Greedy

The greedy approach allocates all orders to the action with the lowest estimated value. This policy always exploits current knowledge to maximize immediate reward:

ϵ-Greedy

The ε-greedy approach behaves greedily most of the time but with probability ε selects randomly among the suboptimal actions:

An advantage of this policy is that it converges to the optimal action in the limit.

Upper-Confidence-Bound

The Upper-Confidence-Bound (UCB) approach selects the action with the lowest action value minus a term that is inversely proportional to the number of times the trading algorithm is used, i.e. Nt(a). The approach thus selects among the non-greedy actions according to their potential for actually being optimal and the associated uncertainties in those estimates:

Thompson Sampling

The Thompson Sampling approach, as proposed by Thompson (1933), assumes a known initial distribution over the action values and updates the distribution after each order allocation³. The approach selects actions according to their posterior probability of being the best action:

Evaluating policies

In practice, policies are commonly evaluated on regret which is the deviation from the optimal solution:

Eq. 11: Definition of regret as a function of a sequence of actions.

where μ* is the minimal execution cost mean:

Eq. 12: Expected execution cost for choosing the optimal action.

Actions are a direct consequence of the policy, and we can therefore also define regret as a function of the chosen policy:

Eq. 13: Definition of regret as a function of an action selection policy π.

In Figure 3, we simulate the regret for the aforementioned policies in the dummy example. We observe that the Upper-Confidence-Bound approach and Thompson sampling approach perform best.

Figure 3: Simulated regret for different action selection policies for dummy order allocation problem. Source: Author.

Allocating orders? Embrace uncertainty!

The dummy example simulation results strongly indicate that relying solely on a greedy approach may not yield optimal outcomes. It is, therefore, crucial to incorporate and measure the uncertainty in the execution cost estimates when developing an order allocation strategy.

Footnotes

¹ To ensure comparability of the empirical distribution of the execution cost, we need to either allocate similar orders or use order-agnostic cost metrics for evaluation.

² In situation where an algorithm's execution cost are dependent on the order characteristics, contextual bandits are a more suitable option. To learn more about this approach, we recommend Chapter 2.9 in Barto & Sutton (2018) for an introduction.

³ We strongly suggest Russo et al. (2018) as an outstanding resource to learn about Thompson sampling.

Additional resources

The following tutorials / lectures were personally very helpful for my understanding of multi-armed bandit problems.

Industry

Research Scientist Robert Schapire @ Microsoft
Research Scientist Hado van Hasselt @ Deepmind

Academia

Assistant Professor Christina Lee Yu @ Cornell
Assistant Professor Emma Brunskill @ MIT

References

[1] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

[2] Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). A tutorial on Thompson sampling. Foundations and Trends® in Machine Learning, 11(1), 1–96.

[3] Thompson, W. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika. 25(3/4): 285–294.

[4] Thompson, W. R. 1935. On the theory of apportionment. American Journal of Mathematics. 57(2): 450–456.

[5] Eckles, D. and M. Kaptein. 2014. Thompson sampling with the online bootstrap. arXiv preprint arXiv:1410.4009.

If you're keen on reading more, see a selection of my articles below:

Cost decomposition for a VWAP execution algorithm: Buy-side perspective.

Introduction to Probabilistic Classification: A Machine Learning Perspective

Beyond traditional asset return modelling: Embracing thick tails.

Tags: Algorithmic Trading Machine Learning Multi Armed Bandit Optimal Execution Reinforcement Learning