Published on November 8, 2024
In Global Tech

Google DeepMind Just Made Small Models Irrelevant with RRTs

Small language models are great for efficiency, but what about their compromised performance?

Illustration by Nalini Nirad

by Supreeth Koundinya

While there’s no shortage of efforts to build powerful LLMs that solve all sorts of challenges, Google DeepMind’s approach to reducing the cost, computing, and resources required for an LLM to function, is a ray of hope for various environmental and sustainability concerns.

Google DeepMind, in collaboration with KAIST AI, suggests a method called Relaxed Recursive Transformers, or RRTs.

It allows LLMs to be programmed to behave like small language models yet outperform many of the standard SLMs present in the ecosystem today.

Less is More

Google DeepMind proposes a technique called Layer Tying. Instead of processing the input through a large number of layers, it can be made to pass through a small number of layers recursively, so that it creates an equivalent impact. This cuts down memory requirements and significantly reduces computational resources.

This technique also introduces LoRA or low-ranking adaptation. Low-rank matrices are set up to adjust the shared weights by introducing a slight amount of variation to ensure that each repeated pass provides some distinct behaviour in processing the input.

The model is also uptrained, meaning the recursive layers are iteratively fine-tuned to provide the weights with any additional training data to perform the task.

And that’s not all—RRT uses a continuous batch-wise processing technique in which multiple inputs are processed simultaneously. Inputs in a batch can be at different points inside the layer looping structure. One part of the batch can be processed in its first loop, and another part can be processed in the second/third loop.

If a satisfactory output is generated before the input completes the set number of loops – the input can exit the model, potentially saving computational resources.

In an interaction with AIM, Sangmin Bae, one of the paper’s authors, said, “We identified a critical challenge inherent in early exiting: the synchronisation issue. This issue arises when a token exits at an intermediate layer, requiring it to wait for other unfinished samples within the same batch to complete processing.”

“Consequently, batched inference suffers from reduced efficiency, hindering its widespread adoption in practical applications,” said Bae

Numbers Don’t Lie

The authors compared a large language model with a recursive layer to a small language model of a similar parameter size. For an uptrained recursive Gemma 1B model converted from a pre-trained Gemma 2B, a 13.5 percentage point absolute accuracy improvement (22% error reduction) was observed on few-shot tasks compared to a non-recursive Gemma 1B model (pre-trained from scratch).

The recursive, uptrained Gemma model uptrained on 60 billion tokens and achieved performance parity with the full-size Gemma model trained on 3 trillion tokens.

On a larger scale, these numbers may very well contribute to impactful energy savings. “We also anticipate a significant reduction in the parameter footprint, leading to substantial energy savings commensurate with the increase in inference throughput.” added Bae.

Like most research, RRTs have their fair share of challenges. Bae said, “Further research is needed to determine the uptraining cost associated with scaling to larger models.

Currently, the research presents only hypothetical speedup estimates based on an Oracle-exiting algorithm, without actual implementation of an early-exiting algorithm.

“Future work will focus on achieving practical speedup and inference optimisation with real-world early-exiting algorithms,” added Bae. Once these challenges are resolved, It may not take long before RRTs are up and running in real-world applications

Asserting his belief on the same, Bae says “We believe this approach is scalable to significantly larger models. With effective engineering of continuous depth-wise batching, we foresee substantial inference speed improvements in real-world deployments.”

Beyond Meta’s Quantization and Layer Skip

Google DeepMind isn’t the only one exploring ideas to scale down LLMs without compromising on performance. A few days ago, Meta made quite the mark in introducing quantised LLMs that can be processed inside devices with lower memory.

Both Quantisation and RRT increase the efficiency of LLMs and use LoRA to compensate for performance losses, but they’re not quite the same thing.

Quantisation focuses on reducing the precision of weights to reduce the overall space and memory occupied by the model. However, for RRTs, the onus is on increasing the throughput and the speed at which the inputs are processed and outputs are generated.

One of Meta’s other works includes Layer Skip, which involves skipping the layers inside an LLM during both training and inference. It employs ‘early exit loss’ to ensure performance isn’t affected.

Unlike Layer Skip and Quantisation, RRTs involve parameter sharing and recursively reusing them.

Bae says, “Unlike traditional LayerSkip, which generates N draft tokens and then verifies them in an upper layer with N inputs, RRTs allow real-time verification during draft token generation due to shared parameters between the draft model and upper layers (for verification).”

He also mentioned that post-optimisation, Layer Skip and quantisation can be applied along with RRTs.

“We anticipate significant synergy with speculative decoding methods. Specifically, continuous batching in RRTs enables real-time verification of draft tokens, promising substantial speed improvements,” he said.

Efficient Language Models

We’ve seen a rise in small language models over the recent few months, and they can greatly help in several applications that do not demand high output accuracy.

With further research and development focusing on improving the performance of SLMs and optimising LLMs, will we reach a point where standard large parameter models seem redundant for most applications?

Meta’s quantised models, Microsoft’s Phi, HuggingFace’s SmolLM and OpenAI’s GPT Mini indicate strong efforts to build efficient, and small-sized models. The Indian AI ecosystem was quick to turn towards SLM as well. Recently, Infosys and Saravam AI collaborated to develop small language models for banking and IT applications.

Soon, we’ll also certainly see a rising interest in techniques and frameworks that optimise LLMs.

📣 Want to advertise in AIM? Book here

Supreeth Koundinya

Supreeth is an engineering graduate who is curious about the world of artificial intelligence and loves to write stories on how it is solving problems and shaping the future of humanity.

LinkedIn Reveals India’s Top Skills for 2025, AI Literacy Takes Lead

What Was Former Intel CEO Doing at NVIDIA’s Flagship Event?

Are Adobe’s AI Agents the Final Step to Fully Automated Customer Service?

NVIDIA Announces 2 Personal Supercomputers—One is as Small as Mac Mini

Lok Sabha, MeitY Sign MoU to Launch ‘Sansad Bhashini’ for AI-Powered Multilingual Parliamentary Operations

Anthropic Launches Claude 2.1, Surpasses GPT-4 Turbo in Context Length

Anthropic to Launch Voice Mode Soon, More Features Incoming for Business Users

Wi-Fi Troubles are About to be a Thing of Past, Thanks to AI

ElevenLabs Brings PM Modi’s Voice to the World

Association of Data Scientists

GenAI Corporate Training Programs

Our Upcoming Conference

Happy Llama 2025

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed

‘Most Data Centres Are Not Ready for Liquid Cooling’, says Oracle Exec on NVIDIA Blackwell

Siddharth Jindal

Built on the Blackwell architecture introduced last year, Blackwell Ultra features the NVIDIA GB300 NVL72 rack-scale solution and the NVIDIA HG B300 NVL16 system.