Google DeepMind Just Made Small Models Irrelevant with RRTs

Small language models are great for efficiency, but what about their compromised performance?
Illustration by Nalini Nirad

While there’s no shortage of efforts to build powerful LLMs that solve all sorts of challenges, Google DeepMind’s approach to reducing the cost, computing, and resources required for an LLM to function, is a ray of hope for various environmental and sustainability concerns. 

Google DeepMind, in collaboration with KAIST AI, suggests a method called Relaxed Recursive Transformers, or RRTs. 

It allows LLMs to be programmed to behave like small language models yet outperform many of the standard SLMs present in the ecosystem today. 

Less is More

Google DeepMind proposes a technique called Layer Tying. Instead of processing the input through a large number of layers, it can be made to pass through a small number of layers recursively, so that it creates an equivalent impact. This cuts down memory requirements and significantly reduces computational resources. 

This technique also introduces LoRA or low-ranking adaptation. Low-rank matrices are set up to adjust the shared weights by introducing a slight amount of variation to ensure that each repeated pass provides some distinct behaviour in processing the input. 

The model is also uptrained, meaning the recursive layers are iteratively fine-tuned to provide the weights with any additional training data to perform the task. 

And that’s not all—RRT uses a continuous batch-wise processing technique in which multiple inputs are processed simultaneously. Inputs in a batch can be at different points inside the layer looping structure. One part of the batch can be processed in its first loop, and another part can be processed in the second/third loop. 

If a satisfactory output is generated before the input completes the set number of loops – the input can exit the model, potentially saving computational resources.

“Consequently, batched inference suffers from reduced efficiency, hindering its widespread adoption in practical applications,” said Bae

Numbers Don’t Lie

The authors compared a large language model with a recursive layer to a small language model of a similar parameter size. For an uptrained recursive Gemma 1B model converted from a pre-trained Gemma 2B, a 13.5 percentage point absolute accuracy improvement (22% error reduction) was observed on few-shot tasks compared to a non-recursive Gemma 1B model (pre-trained from scratch).

The recursive, uptrained Gemma model uptrained on 60 billion tokens and achieved performance parity with the full-size Gemma model trained on 3 trillion tokens. 

Like most research, RRTs have their fair share of challenges. Bae said, “Further research is needed to determine the uptraining cost associated with scaling to larger models. 

Currently, the research presents only hypothetical speedup estimates based on an Oracle-exiting algorithm, without actual implementation of an early-exiting algorithm. 

“Future work will focus on achieving practical speedup and inference optimisation with real-world early-exiting algorithms,” added Bae. Once these challenges are resolved, It may not take long before RRTs are up and running in real-world applications

Asserting his belief on the same, Bae says “We believe this approach is scalable to significantly larger models. With effective engineering of continuous depth-wise batching, we foresee substantial inference speed improvements in real-world deployments.”

Beyond Meta’s Quantization and Layer Skip

Google DeepMind isn’t the only one exploring ideas to scale down LLMs without compromising on performance. A few days ago, Meta made quite the mark in introducing quantised LLMs that can be processed inside devices with lower memory. 

Both Quantisation and RRT increase the efficiency of LLMs and use LoRA to compensate for performance losses, but they’re not quite the same thing. 

Quantisation focuses on reducing the precision of weights to reduce the overall space and memory occupied by the model. However, for RRTs, the onus is on increasing the throughput and the speed at which the inputs are processed and outputs are generated. 

One of Meta’s other works includes Layer Skip, which involves skipping the layers inside an LLM during both training and inference. It employs ‘early exit loss’ to ensure performance isn’t affected. 

Unlike Layer Skip and Quantisation, RRTs involve parameter sharing and recursively reusing them.

He also mentioned that post-optimisation, Layer Skip and quantisation can be applied along with RRTs.  

“We anticipate significant synergy with speculative decoding methods. Specifically, continuous batching in RRTs enables real-time verification of draft tokens, promising substantial speed improvements,” he said. 

Efficient Language Models

We’ve seen a rise in small language models over the recent few months, and they can greatly help in several applications that do not demand high output accuracy. 

With further research and development focusing on improving the performance of SLMs and optimising LLMs, will we reach a point where standard large parameter models seem redundant for most applications? 

Meta’s quantised models, Microsoft’s Phi, HuggingFace’s SmolLM and OpenAI’s GPT Mini indicate strong efforts to build efficient, and small-sized models. The Indian AI ecosystem was quick to turn towards SLM as well. Recently, Infosys and Saravam AI collaborated to develop small language models for banking and IT applications

Soon, we’ll also certainly see a rising interest in techniques and frameworks that optimise LLMs. 

📣 Want to advertise in AIM? Book here

Picture of Supreeth Koundinya

Supreeth Koundinya

Supreeth is an engineering graduate who is curious about the world of artificial intelligence and loves to write stories on how it is solving problems and shaping the future of humanity.
Related Posts
Association of Data Scientists
GenAI Corporate Training Programs
Our Upcoming Conference
India's Biggest Conference on AI Startups
April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.