Not long ago, running a large language model (LLM) meant relying on massive graphics processing units (GPUs) and expensive hardware. Now, however, things are starting to change. A new wave of smaller, more efficient LLMs is emerging, which are capable of running on a single GPU without compromising on performance. These models are making high-end AI more accessible, reducing dependency on large-scale infrastructure, and reshaping how AI is deployed.
As Bojan Tunguz, former NVIDIA senior software system engineer, had quipped, “Blessed are the GPU poor, for they shall inherit the AGI.”
In the past week, a series of announcements in AI has been made. Mistral’s latest model, Small 3.1, Google’s Gemma 3, and Cohere’s Command A all claim to match the performance of proprietary models while requiring fewer compute resources.
These models enable developers, small businesses, and even hobbyists with consumer-grade hardware (e.g., a single NVIDIA RTX card) to run advanced AI models locally.
Moreover, running LLMs locally on a single GPU reduces reliance on cloud providers like AWS or Google Cloud, giving businesses more control over their data and privacy. This is critical for industries handling sensitive information and regions with limited internet access.
What Makes Them Special?
Mistral Small 3.1 features improved text performance, multimodal understanding, and an expanded context window of up to 128k tokens. The company said the model outperforms comparable models like Google’s latest release, Gemma 3, and GPT-4o mini while delivering inference speeds of 150 tokens per second.
However, one of the most notable features of the model is that it can run on a single RTX 4090 or a Mac with 32 GB RAM, making it a great fit for on-device use cases. The company said that the model can be fine-tuned to specialise in specific domains, creating accurate subject matter experts. This is particularly useful in fields like legal advice, medical diagnostics, and technical support.
On the other hand, Google claims that Gemma 3 outperforms Llama 3-405B, DeepSeek-V3, and o3-mini in preliminary human preference evaluations on the LMArena leaderboard. Like Mistral 3.1, it can also be run on a single GPU or a tensor processing unit (TPU).
“Compare that to Mistral Large or Llama 3 405B, needing up to 32 GPUs—Gemma 3 slashes costs and opens doors for creators,” said a user on X. Notably, a single NVIDIA RTX or H100 GPU is far more affordable than multi-GPU clusters, making AI viable for startups and individual developers.
Gemma 3 27B achieves its efficiency by running on a single NVIDIA H100 GPU at reduced precision, specifically using 16-bit floating-point (FP16) operations, which are common for optimising performance in modern AI models.
LLMs typically use 32-bit floating-point (FP32) representations for weights and activations, requiring huge memory and compute power. Quantisation reduces this precision to 16-bit (FP16), 8-bit (INT8), or even 4-bit (INT4), significantly reducing model size and accelerating inference on GPUs and edge devices.
Regarding the architecture, Gemma 3 employs a shared or tied language model (LM) head for its word embeddings, as indicated by its linear layer configuration, where the LM head weights are tied to the input embeddings.
Similarly, Cohere recently launched Command A, a model that delivers top performance with lower hardware costs than leading proprietary and open-weight models like GPT-4o and DeepSeek-V3.
According to the company, it is well-suited for private deployments, excelling in business-critical agentic and multilingual tasks while running on just two GPUs, whereas other models often require up to 32.
“With a serving footprint of just two A100s or H100s, it requires far less compute than other comparable models on the market. This is especially important for private deployments,” the company said in its blog post.
It offers a 256k context length—twice that of most leading models—allowing it to process much longer enterprise documents. Other key features include Cohere’s advanced retrieval-augmented generation (RAG) with verifiable citations, agentic tool use, enterprise-grade security, and strong multilingual performance.
Microsoft recently launched Phi-4-multimodal and Phi-4-mini, the latest additions to its Phi family of small language models (SLMs). These models are integrated into Microsoft’s ecosystem, including Windows applications and Copilot+ PCs.
Earlier this year, NVIDIA launched a compact supercomputer called DIGITS for AI researchers, data scientists, and students worldwide. It can run LLMs with up to 200 billion parameters locally, and with two units linked together, models twice the size can be supported, according to NVIDIA.
Moreover, open-source frameworks facilitate running LLMs on a single GPU. Predibase’s open-source project, LoRAX, allows users to serve thousands of fine-tuned models on a single GPU, cutting costs without compromising speed or performance.
LoRAX supports a number of LLMs as the base model including Llama (including Code Llama), Mistral (including Zephyr), and Qwen.
It features dynamic adapter loading, instantly merging multiple adapters per request to create powerful ensembles without blocking concurrent requests. Heterogeneous continuous batching packs requests using different adapters into the same batch, ensuring low latency and stable throughput.
Adapter exchange scheduling optimises memory management by asynchronously preloading and offloading adapters between GPU and CPU memory. High-performance inference optimisations, including tensor parallelism, pre-compiled CUDA kernels, quantisation, and token streaming, further improve speed and efficiency.
Running LLMs Without a GPU?
A few days ago, AIM spoke to John Leimgruber, a software engineer from the United States with two years of experience in engineering, who managed to run the 671-billion-parameter DeepSeek-R1 model without GPUs. He achieved this by running a quantised version of the model on a fast NVM Express (NVMe) SSD.
Leimgruber used a quantised, non-distilled version of the model, developed by Unsloth AI—a 2.51 bits-per-parameter model, which he said retained good quality despite being compressed to just 212 GB.
However, the model is natively built on 8 bits, which makes it efficient by default.
Leimgruber ran the model after disabling his NVIDIA RTX 3090 Ti GPU on his gaming rig, with 96 GB RAM and 24 GB VRAM.
He explained that the “secret trick” is to load only the KV cache into RAM while allowing llama.cpp to handle the model files using its default behaviour—memory-mapping (mmap) them directly from a fast NVMe SSD. “The rest of your system RAM acts as disk cache for the active weights,” he said.
With LLMs now running on a single GPU—or even without one—AI is becoming more practical for everyone. As hardware improves and new techniques emerge, AI will become even more accessible, affordable, and powerful in the years ahead.