KAN: Why and How Does It Work? A Deep Dive

Author:Murphy | View: 26009 | Time: 2025-03-22 21:44:20

Can a Neural Net Discover New Physics? (Generated with DALLE-2 by Author)

Last week while we were at an AI & Physics conference (EuCAIFCon), a lot of the discussions were on Foundational models and whether it is possible to discover potentially new laws in Physics using AI, and lo and behold: recently the Kan paper¹ came out in arXiv discussing possibilities of discovering/rediscovering physics and mathematical models using neural net.

I got some time over the weekend to go through parts of this fascinating paper and here we will take a deep dive into the math, formulations, and ideas behind constructing KANs. We will also install KAN to check one of the many amazing results presented in the paper by the MIT researchers.

We will go through the details to get accustomed to mathematical notations and concepts to appreciate why this network is causing such a big fuss!

Let's begin.

KA Representation Theorem:

The KAN or Kolmogorov-Arnold Network is based on the famous mathematicians Kolmogorov & Arnold's representation theorem. So first let's take a few steps back to understand this theorem.

KA theorem was introduced by these mathematicians to work on Hilbert's 13th problem: David Hilbert suggested that it is necessary to prove whether a solution exists for all 7th-degree equations using algebraic functions of two arguments. What does it mean?

Say we have a 7th-order polynomial as below:

Eq. 1: Equation with higher-order polynomials

Hilbert asked whether its solution, x, considered as a function of the three variables a,b,c, can be expressed as the composition (h=g∘f,→h(x)=g(f(x))) of a finite number of two-variable functions.

The KA representation theorem states that for any continuous function f:[0,1]ᵈ → R, there exists univariate continuous functions _gq, ψ{p,q}_ such that:

This means that the (2d+1)(d+1) univariate functions _gq, ψ{p,q}_ are enough for an exact representation of a d-variate function⁴. The underlying point here is that the only truly multivariate function is addition since every other function can be written using univariate functions and sum¹.

We bear in mind that _gq, ψ{p,q}_ both are univariate functions. So, any continuous function of several variables can be expressed as the composition of univariate functions. Wow! I didn't know that. Cool! Let's proceed.

Thinking About KA Theorem & MLPs:

The authors in the paper state that these 1D functions (defined before) can be non-smooth and even fractal, so they may not be learnable in practice. To build KANs the authors go beyond the definition of the KA representation theorem but first let's think about MLPs.

The Multi Layer Perceptrons (MLPs) are based on the universal approximation theorem which states that any continuous function f:[0, 1]ᵈ → [0, 1] can be approximated arbitrarily well by a neural network (weights, biases and non-linear activation function) with at least 1 hidden layer with a finite number of weights. During backpropagation, the network learns to optimize the weights and biases to act as a function approximator while the activation functions remain fixed.

Now, can we build a neural-net architecture based on the KA representation theorem discussed above?

If we think about a supervised learning problem where given {_x_i, yi} pair we want to find a function such that _yi ≈ f(_xi), then the KA representation theorem tells us that all we need to find are the univariate functions in eq. 2 (_gq, ψ{p,q}_).

Here, the authors argue that as we need to learn only univariate functions, we can parametrize each 1D function as a B-spline curve (check below), with learnable coefficients of local B-spline basis functions. This leads to the prototype KAN and is illustrated in Figure 1 (Model(shallow) (b)), with input dimensions n=2 appearing as a two-layer neural network with activation functions placed on edges instead of nodes (simple summation is performed on nodes), and with width 2n+1 in the middle layer. The construction of the network i.e. number of activations, nodes, etc will be clear soon. Given that definition in Eq. 2, the original KA representation theorem can be thought of as depth 2 with each layer having (2d+1) terms (sorry

Tags: Kan Mathematics Neural Networks Optimization Thoughts And Theory