Andrii Krutsylo | UCSC OSPO

Wrapping Up KALLM

Wed, 03 Sep 2025 00:00:00 +0000

Large language models today look complicated, but if you peel back the layers, most of what you see is old technology: stacks of linear transformations. The Transformer architecture, the engine behind GPTs and their cousins, is often described as revolutionary. Yet the majority of its parameters are standard linear layers, the same kind of matrix multiplications you would find in a simple multilayer perceptron from the 1980s. For years these layers have gone unchallenged. They are fast, they scale, and they work. But maybe the time has come to ask: can we do better than linear?

This project explored exactly that. Instead of leaving those layers untouched, we tried replacing them with a more mathematically structured alternative: Kolmogorov–Arnold Networks (KANs). The result is a working language model—SmolLM2, a 135-million-parameter Transformer—where the final feedforward blocks no longer consist of brute-force linear weights, but of compact polynomial-based functions. And the striking fact is that performance remained within the baseline range. Smaller KANs managed to match larger linear layers, showing that smarter mathematics can stand shoulder to shoulder with the workhorse of deep learning.

Transformers

To understand the significance, let’s revisit what a Transformer actually is.
A Transformer block has two main components: attention and feedforward. The attention mechanism computes how each word in a sentence relates to every other word. That is the clever part, and it is what made Transformers famous. But once attention finishes its work, the output is passed into a feedforward network. And this feedforward network is essentially two large linear layers, stacked with a nonlinearity between them.

Now stacking thirty such blocks yields a complete model like SmolLM2. Look at the parameter counts and you see a pattern: attention is not the main consumer. It’s the feedforward layers. They dominate memory and computation, making them the primary target for efficiency gains.

What Are Kolmogorov–Arnold Networks?

So what happens if, instead of a giant matrix multiplication, we try something more structured? Enter Kolmogorov–Arnold Networks.

KANs are built on a mathematical theorem from the mid-20th century, which proved that any multivariate function can be decomposed into sums of univariate functions. Instead of mixing all inputs together at once, you treat each input dimension separately, applying a small nonlinear function, and then recombine. The beauty is that these univariate functions can be simple but expressive—like splines or polynomials—and yet, when summed, they approximate very complex mappings.

Think of a KAN layer as a set of individual univariate modules. Each one takes a single variable, bends it according to a chosen basis (polynomials, splines, etc.), and then all those bent versions are added up to produce the output. The richness of the final function depends on two factors:

Choice of basis: You can bend with Chebyshev polynomials, with Legendre polynomials, with B-splines, or with other families.
Degree: This is how many bends you allow. A degree-1 polynomial is just a line. Degree-2 can capture curves. Higher degrees capture higher-order oscillatory components.

A Chebyshev polynomial of the second kind, degree 2, is one such basis. Unlike a simple quadratic, it has roots and oscillations that make it particularly good at spanning function space efficiently. This efficiency explains its favorable performance in our experiments: low degree means fewer parameters, but Chebyshev’s properties let it approximate more than you might expect from so few numbers.

Why Small Can Beat Big

Linear layers require many parameters because they treat every input–output mapping as arbitrary. KANs assume smoothness: each input passes through a compact polynomial basis before recombination. This structure captures useful patterns with fewer parameters.

A degree-2 Chebyshev basis, for example, encodes curvature and oscillation efficiently. While a linear layer of the same size must spend parameters to approximate these effects, the polynomial basis includes them inherently. The result is comparable expressivity with fewer parameters. In language tasks where patterns are often smooth or compositional, this structured efficiency translates into competitive accuracy at lower cost.

Baselines, Modifications, and Comparisons

Here’s what we actually tested, in plain language:

The untouched baseline: a pretrained SmolLM2, with all thirty blocks intact.
Linear restart: the same pretrained model, but the last five feedforward modules were thrown away and replaced with freshly initialized linear ones. These then had to be trained again.
KAN replacement: again, take the pretrained model, cut off the last five feedforward modules, and put in new KAN modules instead—specifically, Chebyshev of the second kind, degree 2.

In all three cases, the backbone of the model—the embeddings, the attention layers, and the first twenty-five blocks—was left untouched. Only the tail was modified. This design allowed us to test transfer learning: would the pretrained parts of the model still play nicely with the new pieces? The answer is yes. The attention layers and other linear projections adapted seamlessly, proving that KANs can be swapped in without destabilizing the whole system.

Training was done on smol-smoltalk dataset, a small-scale dialogue corpus used for both pretraining and fine-tuning. After training, all models were evaluated on the same subset of BIG-Bench Hard tasks.

Results

The baseline was the pretrained SmolLM2 without modification. It achieved an average accuracy of 22.5%, using 134M parameters. This experiment has a single measurement because no training was applied. The rest of the experiments was done using 3 random seeds.

When retrained with linear replacements, the model reached an average accuracy of 43.8%, with 46M trainable parameters (only last 5 blocks are active) and 5.87 GB VRAM total usage.

Replacing the last five feedforward blocks with Kolmogorov–Arnold Networks produced an average accuracy of 44.1%, with 39M parameters and 5.86 GB VRAM usage. The memory consumption of KAN layers is a subject that requires further optimization.

In short, KANs matched or slightly exceeded the reinitialized linear baseline in accuracy, while using fewer parameters and slightly less memory. This demonstrates that structured polynomial layers can substitute for large linear layers without degrading reasoning performance.

Why Transfer Learning Works So Well

One of the surprising outcomes is how cleanly the pretrained Transformer integrates with KANs. Remember: only the feedforward modules in the last five blocks were replaced. All the other linear layers—embedding projections, attention queries, keys, and values, output heads—remained untouched. They continue to function as before. The new KAN blocks slot right in, adapt during training, and the system as a whole behaves coherently.

That tells us something important. The standard Transformer does not depend on linearity per se in those positions. What it depends on is a nonlinear transformation with enough expressive power. KANs provide that power, just in a different mathematical form. Which means: any pretrained Transformer can, in principle, be retrofit with KANs in the feedforward slots, no need to start from scratch.

Looking Ahead: Mixing Polynomial Bases

So far we only tested one family, Chebyshev-2. But the architecture is more general. Each KAN block can in fact host multiple polynomial families in parallel, or stack them in sequence.

Parallel: imagine splitting the input across several channels, each processed by a different basis. The outputs are then recombined. This way, one basis covers the smooth global structure, while another captures edge effects or oscillations.
Sequential: here, the output of one polynomial transformation becomes the input of another. You can think of it as layering function approximations, where the second basis corrects the limitations of the first. For example, a spline might give you piecewise smoothness, then a Chebyshev layer on top could adjust the global shape.

Both strategies were implemented and promise to extract more expressivity per parameter. Instead of simply making the networks bigger, we can make them smarter, combining the strengths of different mathematical families. That will be the focus of future work.

Conclusion

The main lesson is this: language models do not need to be built entirely from massive linear matrices. By replacing just a handful of those matrices with compact Kolmogorov–Arnold modules, we achieved the same reasoning accuracy with fewer parameters and less memory. Transfer learning works cleanly. The architecture adapts. And the door is now open to rethink what belongs inside a Transformer block.

KANs are not just a theoretical curiosity. They are practical, efficient, and compatible with modern large language models. This project showed that replacing linear with polynomial is not only possible, it is competitive. The next step is to push combinations, explore scaling, and see just how far this mathematical alternative can take us.

Midterm Report: KAN Integration into LLMs

Fri, 18 Jul 2025 00:00:00 +0000

Imagine if we could make neural networks that are not just more efficient, but smarter in how they learn. That’s the promise behind Kolmogorov–Arnold Networks (KANs)—a fascinating new architecture that replaces the usual “weighted sums and activation functions” with more mathematical finesse. Instead of processing all inputs in one big lump, KANs treat each input dimension individually, transforming them with elegant functions like B-splines or simpler polynomials. The idea is simple but powerful: do more with less.

For my project, I set out to explore what happens when we integrate these KAN layers into a lightweight language model called SmolLM2, training and testing it on a smol-smoltalk dataset.

Setting the Stage: SmolLM2 Meets KAN

The original SmolLM2 has 135 million parameters and 30 transformer blocks—plenty of moving parts. To keep things manageable during the initial phase, I created a mini version of the model with just 3 blocks and a trimmed-down vocabulary. This setup let me test dozens of KAN variations quickly, using a simple text classification task (AGNews) as a playground before moving on to full-scale language modeling.

Despite working with a simplified model, I managed to successfully train a full 30-block KAN-based SmolLM2. That model even passed challenging language benchmarks with flying colors—matching the performance of the original, linear-layer version. That’s a big win.

What Worked (and What Didn’t)

Along the way, I tried out a variety of KAN flavors: spline-based, radial basis functions (RBF), rational functions, and no fewer than eight types of orthogonal polynomials—like Chebyshev, Legendre, and Hermite. Each one brings its own quirks, strengths, and training times.

Some key takeaways:

Chebyshev (second kind) with a low polynomial degree (just 2!) delivered the best speed/accuracy trade-off.
Jacobi and Gegenbauer polynomials edged slightly ahead in raw accuracy but required much longer training times.
Replacing each linear layer with a KAN version (keeping parameter count similar) worked fine—but layering them in parallel or sequence didn’t add much.
A baseline with regular linear layers still performed slightly better (60.8% vs. 60.3%), but KANs showed they can come close with room for optimization.

Why This Matters

What’s compelling is not just that KANs can work, but that they bring some appealing properties:

Parameter efficiency: Good performance with fewer or similarly-sized layers.
Flexibility: They adapt well to existing hyperparameters—less fine-tuning needed.
Stability: They run smoothly in fp16 (a lower-precision format), which is critical for efficient training.
Potential for richer activations: Some existing projects still rely on activations like ReLU or SiLU alongside KANs. But I found KANs alone could learn well without them, opening up more dynamic architectures in the future.

What’s Next

With the heavy lifting done, code written, models trained, ideas tested, the remainder of the project is focused on refinement. That means more training on generative tasks, better tuning of polynomial degrees, smarter initialization strategies, and potentially making KAN-based layers more plug-and-play.

The fact that a fully KAN-powered SmolLM2 can hold its own on tough language benchmarks is more than just a proof of concept. It’s a hint that we might not have to keep scaling models indefinitely to get better performance. Instead, we can get more from each parameter, by changing how the model thinks.

Kolmogorov-Arnold-based Transformer for LLMs

Sun, 15 Jun 2025 00:00:00 +0000

Project: KALLM

Proposal: proposal

Mentors:

Sai Suman Lamba Karanam
Prof. Zahmeeth Sakkaff

I am modifying existing large language models to make them more efficient by replacing some of their layers with Kolmogorov-Arnold Network (KAN) modules. These KAN layers use compact univariate polynomial approximations, which can reduce parameter count and improve interpretability. The project explores how to integrate these layers into Transformers, and how far we can push this idea by combining or stacking KAN modules with different polynomial bases. The goal is to keep performance competitive while lowering computational costs.

Beyond just speeding up training, I am exploring several other promising directions. One is testing whether transfer learning remains effective when replacing the linear layers of a pretrained LLM with KAN modules, or when swapping between different KAN configurations. I am also considering curriculum learning strategies that gradually increase KAN complexity during training. I have studied all major KAN implementations and early experiments with a custom Transformer architecture show encouraging results. However, I have found that most LLMs rely on functional-style activation definitions in PyTorch, which makes it difficult to build a universal wrapper. Because of this, KAN-based models will likely need to be integrated manually on a case-by-case basis.