Mixture of Experts for PINNs (MoE-PINNs)

Author:Murphy | View: 30055 | Time: 2025-03-23 19:55:16

Physics-Informed Neural Networks (PINNs) [1] have emerged as a popular and promising approach for solving partial differential equations (PDEs). In our latest research, my colleague Michael Kraus and I introduce a new framework called Mixture of Experts for Physics-Informed Neural Networks (MoE-PINNs) [2], which shows great potential on PDEs with complex and varied patterns.

In this post, we will discuss the benefits of MoE-PINNs and how they can be easily implemented to solve a wide range of PDE problems. It is structured the following way:

Introduction to MoE-PINNs
Solving Burgers' PDE using MoE-PINNs
Reducing Hyperparameter Search with Sparsity Regularisation
Example of Sparsity Regularisation by solving the Poisson PDE on an L-shaped Domain
Automated Architecture Search with Differentiable MoE-PINNs

To help you gain a better understanding of the concepts discussed in this article, we have provided accompanying notebooks that can be run directly on Colab:

PINNs use physical laws and automatic differentiation to solve Partial Differential Equations (PDEs) with just a few lines of code. However, they are also highly susceptible to their hyperparameters, such as the activation function or initialisation of the weights. This makes training PINNs particularly difficult and a laborious, iterative process. The depth of the network and the choice of activation function can greatly impact the solution accuracy. For example, deep networks work well for complex PDEs with varying patterns and discontinuities, such as Navier-Stokes equations. Meanwhile, shallow networks may suffice for simple PDEs with simpler patterns, like the Poisson equation on a square domain. The sine activation function, with its property of preserving shape under differentiation, may be ideal for high-order differentiation problems. On the other hand, activation functions like swish or softplus may perform better for problems with sharp discontinuities.

But what if your problem requires a combination of both? How do we handle a PDE that is smooth and repeating in one part, and highly complex with sharp discontinuities in another? This is where the Mixture of Experts (MoE) framework for PINNs comes in. By leveraging multiple networks and a gate to divide the domain, each expert can specialise in solving a distinct part of the problem, allowing for improved accuracy and reduced bias-variance trade-off.

Dividing the problem into smaller sub-problems has many advantages:

By using several learners on distinct sub-domains, the complexity of the problem is reduced.
The gate in MoE-PINNs is a continuous function, resulting in smooth transitions between the domains. Therefore, more complex regions of the domain can be equally divided amongst several learners, whereas simpler regions can be attributed to a single expert.
The gate can be any type of neural network, from a linear layer to a deep neural network, which allows it to adapt to different types of domains and divide them in an arbitrary way.
MoE can be easily parallelised, since only the weights lambda need to be shared amongst learners on different devices. In theory, each learner could be placed on a distinct GPU.
By initialising a large number of PINNs with different architectures, the need for labor-intensive hyperparameter tuning is reduced.

MoE-PINN Architecture

Mixture of Experts of PINNs. An arbitrary number m of PINNs, possibly with varying architectures and properties, is initialised together with a gating network. All models receive the same input and the gate produces weights that are used for aggregating the results. Figure by author [2].

Unlike traditional PINNs, which use a single neural network to make predictions, MoE-PINNs employ an ensemble of multiple PINNs that are combined using a gating network.

Just like all PINNs in the ensemble, the gating network is a fully-connected feed-forward network that takes in the spatial coordinates x and y (may differ from PDE to PDE). However, unlike the PINNs, its output is an m-dimensional vector of importances, where m is the number of PINNs in the ensemble. The importances are passed through a softmax function to convert them into a probability distribution, ensuring that the sum of all importances equals 1.

The final prediction of the MoE-PINN is obtained by aggregating the predictions of each PINN in the ensemble, weighted by their respective importances.

MoE-PINNs can be very concisely built in TensorFlow with just a few lines of code:

import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, Concatenate

def build_moe_pinn(pinns:List[tf.keras.Model], n_layers:int, n_nodes:int) -> tf.keras.Model:
    x = Input((1,), name='x')
    y = Input((1,), name='y')

    # create predictions with each PINN in the ensemble
    u = Concatenate()([pinn([x, y]) for pinn in pinns])

    # initialise the gating network
    gate = tf.keras.models.Sequential(
        [Dense(n_nodes, activation='tanh') for _ in range(n_layers)] + 
        [Dense(len(pinns), activation='softmax')],
        name='gate',
    )

    # receive importances from the gating network, multiply with PINN
    # predictions and sum up the results
    u = tf.reduce_sum(gate([x, y]) * u, axis=1, keepdims=True)

    return tf.keras.Model([x, y], u, name='moe_pinn')

Burgers' Equation

To illustrate the effectiveness of MoE-PINNs, let us have a look at an example. The Burgers' equation is a PDE used to model phenomenons like shock waves, gas dynamics, or traffic flow. It takes the following form:

Burgers' PDE presents an intriguing challenge: it transitions from a smooth sine function to a sharp, step-like discontinuity over time. This unique property makes it a perfect benchmark for evaluating the performance of MoE-PINNs.

Training MoE-PINNs on Burgers' equation

Let us initialise a MoE-PINN with 5 PINNs as well as a gating network and train it on Burgers' equation. The experts have the following architectures:

Expert 1: 2 layers with 64 nodes each and tanh activation
Expert 2: 2 layers with 64 nodes each and sine activation
Expert 3: 2 layers with 128 nodes each and tanh activation
Expert 4: 3 layers with 128 nodes each and tanh activation
Expert 5: 2 layers with 256 nodes each and swish activation

Prediction and squared error against the ground truth (Spectral Elements Method) on Burgers' equation using a MoE-PINN with 5 experts. Image by author.

More interestingly we can now inspect how the experts were distributed over the domain and what their individual predictions look like:

Weights lambda produced by the gating network (top row) for each of the expert (columns) as well as the predictions from each expert (bottom row) for Burgers' equation. Image by author.

Observe how the gating network in MoE-PINNs was able to effectively allocate tasks to each expert based on their respective capacity for modelling different parts of the domain. The experts with fewer layers and nodes were assigned to the smoother regions, which are relatively easier to model, i.e. close to the initial conditions. Meanwhile, the more complex experts, i.e. those with deeper and wider architectures, were utilised in the regions with discontinuities, where a more complex model was necessary for accurate representation. This is particularly evident in the case of expert 3, which was solely dedicated to capturing the discontinuity.

Sparsity Regularisation for reduced hyperparameter search

MoE-PINNs reduce the need for tuning multiple hyperparameters by allowing you to initialise a diverse group of experts. However, one important hyperparameter remains: the number of experts, m. To get the best results, m should be as high as possible while still fitting within memory. But a large number of experts also comes with increased computational costs. To balance these trade-offs, it is important to determine the minimum number of experts needed to divide the physical domain optimally. One way to achieve this is by adding a regularisation term that encourages sparse weights, lambda, produced by the gating network.

The regularisation loss can be expressed as:

where B is a batch of collocation points (x, t), and p is a hyperparameter that controls the strength of the regularisation. Values of p below 1 enforce sparsity, while values above 1 result in a more uniform distribution. To encourage sparsity, a good starting point for p is 0.5.

Finally, to make the training procedure even more efficient, we can use a heuristic to drop experts that the gating network deems unimportant. For example, an expert can be dropped if its average weight in a batch falls below a certain threshold.

Let us have a look at another example to illustrate this procedure.

Poisson PDE on L-shaped Domain

The Poisson equation is a common tool for modelling physical processes in engineering and natural sciences. For example, it can be used to solve the elastostatic problem of a rod under a torsion load. For testing the sparsity regularisation, let us examine how well MoE-PINNs perform when solving the Poisson equation on a 2-dimensional L-shaped domain with homogeneous Dirichlet boundary conditions:

Visualisation of the Poisson equation on an L-shaped domain defined above solved using Finite Element Method. Image by author [2].

If an engineer had to subdivide this domain and employ various models, an intuitive choice would be to use a different expert in each of the three quadrants in the L-shaped domain: one in the top left, one in the bottom left, and one in the bottom right corner. It will be interesting to see how the MoE-PINN decides to divide the domain and assign experts.

Training sparse MoE-PINNs on the Poisson PDE

When initialising an ensemble with four identical experts, the results of the MoE-PINN look the following:

Prediction and squared error against the ground truth (FEM solution) on the Poisson equation using a MoE-PINN with 4 experts. Image by author [2].

But much more importantly, we can now inspect the importances that were attributed to the individual experts:

Weights lambda produced by the gating network (top row) for each of the expert (columns) as well as the predictions from each expert (bottom row) for the Poisson equation. Figure by author [2].

The figure shows that the gating network made the decision, under the influence of the sparsity regularisation, to almost completely drop expert 1 from the ensemble. This resulted in a more efficient and effective division of the domain amongst the remaining PINNs. The network assigned one dominant expert to each of the three quadrants, creating a symmetrical and intuitive distribution.

It is also worth noting that due to the low average importance of expert 1, if a new training run was to be initiated, this expert could be dropped from the ensemble and the remaining experts could be fine-tuned in a reduced-size, and thus more efficient, ensemble.

Differentiable Architecture Search

Finally, we want to make use of the introduced concepts to reduce the time needed for tuning hyperparameters. MoE-PINNs allow to initialise an ensemble of diverse experts and let the gating network decide, which experts should be retained and which experts could be discarded under the sparsity regularisation.

Importance in ensembles of three experts (left) and four experts (right) with different activation functions on the Poisson equation. Figure by author [2].

Surprisingly, when analysing the importance of experts in diverse ensembles using different activation functions, the gating network consistently discards networks with tanh activations, despite tanh being a commonly used activation in PINN literature. Conversely, the gating network consistently favours experts with sine activations. This preference suggests that using an ensemble of sine activation networks may enhance PINN performance, which aligns with the principles of signal decomposition using the Fourier Transform, stating that any function can be represented as a combination of sine functions of different frequencies.

Importance in ensembles of four experts when varying the depth (left) and the depth (right) of otherwise identical experts on the Poisson equation. Figure by author [2].

Looking at ensembles with experts of varying depth and width, it appears as if a depth of two or three layers may be the optimal choice for PINNs on the poisson equation, whereas it seems like wider networks are to be preferred over narrow ones.

Conclusion

In conclusion, MoE-PINNs are a great extension for improving PINNs on PDEs exhibiting varying patterns and for reducing the time needed for tuning hyperparameters by letting the gating network decide on which architectures to utilise from a diverse set of experts.

If you want to try out MoE-PINNs yourself, have a look at the following notebooks:

Thank you a lot for reading until the end of this article! If you found this article helpful and would like to use MoE-PINNs or the notebooks in your own work, please use the following citation:

R. Bischof and M. A. Kraus, "Mixture-of-Experts-Ensemble Meta-Learning for Physics-Informed Neural Networks", Proceedings of 33. Forum Bauinformatik, 2022

You can find more information about my colleague Michael Kraus on mkrausai.com and about myself on rabischof.ch.

[1] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics-informed Neural Networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics 378 (2019), 686–707.

[2] R. Bischof and M. A. Kraus, "Mixture-of-Experts-Ensemble Meta-Learning for Physics-Informed Neural Networks", Proceedings of 33. Forum Bauinformatik, 2022

Tags: Data Science Machine Learning Neural Networks Physics Thoughts And Theory