Sarvam AI Leaves Everyone at Cypher 2024 Speechless

The company recently launched a range of products, including voice-based agents that are accessible via telephone and WhatsApp. The voice-based agents can also be integrated into an app, allowing users to communicate verbally whenever they choose.

A few months ago, when we asked Sarvam AI chief Vivek Raghavan if it was too soon to call the company the OpenAI of India, he humbly brushed off the comparison. However, with the recent groundbreaking announcements and demos at India’s biggest AI conference, Cypher 2024, it’s clear that the company has truly arrived. 

OpenAI recently launched its Advanced Voice Mode on ChatGPT, and since then, people have been experimenting with it. Deedy Das from Menlo Ventures used it for the dramatic reenactment of a scene in Hindi from the Bollywood movie Dangal

Another user posted a video on X where ChatGPT was singing a duet with him. The possibilities with the voice feature of ChatGPT are endless. 

While OpenAI’s advanced voice features are impressive, India’s very own Sarvam AI is on a similar track, but with an Indian twist. It’s developing voice-based models for the country that understands and speaks local languages and dialects as the locals. 

“India is a country that loves to talk and so, fundamentally, we believe that our solutions need to be voice-led, and voice-led in Indian languages,” said Vivek Raghavan, the founder of Sarvam AI, while speaking at Cypher.

Raghavan has previously worked for AI4Bharat, a research initiative based in IIT Madras, which focuses on open-source Indian language AI. He has over a decade of experience at UIDAI, the entity overseeing the Aadhaar identity system in India.

The company recently launched a range of products, including voice-based agents that are accessible via telephone and WhatsApp. The voice-based agents can also be integrated into an app, allowing users to communicate verbally whenever they choose.

At Cypher, Raghavan provided an exclusive demo of the telephonic agent, which functioned as a customer care representative for a shampoo company. The agent engaged in a smooth conversation in ಕನ್ನಡ (Kannada), effectively managing interruptions from the human on the other end and effortlessly understanding colloquial accents.

In another demo, the agent spoke in हिन्दी (Hindi) and was able to book a dental appointment effortlessly. Raghavan added that anyone can build their own voice agent. 

“You can actually create your own agent, and it becomes available in all languages. You can go natural for any kind of campaign or whatever you want to do. You can listen at scale and unlock new ways of engagement. And that’s really what we want to achieve with these voice agents,” he said.

Regarding the costs, Raghavan said that these agents start at INR 1 per minute. He explained that the price of these agents is low because they work with small models. “We can do this because we are using small models, which allow us to keep the latency low. Essentially, we are focused on optimising our work with these models,” he said.

Moreover, he explained that these agents allow customers to do interesting things on WhatsApp as well. “You can generate UI on WhatsApp and similar platforms. Not only can you have linear conversations, but also engage in complex discussions to understand products and deliver various content, including voice notes, images, and text. Everything is possible,” he said.

He further added that users will be able to make voice-based payments on WhatsApp, as well as create their own workflows, including multimodal, text, image, and UI options. This functionality can also be integrated into other apps, not just WhatsApp.

Focus on Open Source

Besides launching the voice-based agent APIs, the company has also launched Sarvam 2B.

According to Raghavan, Sarvam 2B is among a class of small language models (SLMs) that includes Microsoft’s Phi series models, Llama 3 8 billion, and Google’s Gemma models. He also discussed the trade-off between using large models, like those from OpenAI and Anthropic, and smaller, more efficient models for specific use cases.

If you want to do something a million times a day, and it’s a very narrow thing that you want to do, a small model is always the answer. It is more accurate, more efficient, dissolves latency, and probably also causes less global warming,” said Raghavan. 

The model, available on Hugging Face, is well suited for Indic language tasks such as translation, summarisation and understanding colloquial statements. The startup is open-sourcing the model to facilitate further research & development and to support the creation of applications built on it.

The model is trained on an internal dataset of 4 trillion tokens by an Indian company, using computing resources in India, and provides efficient representation for 10 Indian languages.

“The model is extremely good at summarisation in Indian languages. If you want to do any NLP task in Indian languages, it can beat models that are much bigger, including Llama 3.1,” said Raghavan.

He explained that Sarvam 2B’s Indic tokenisation is about three times more efficient than anything else. “Actually, compared to Llama 3, it is six times more efficient.”

There’s No Stopping Sarvam

The startup also announced Shuka 1.0–India’s first open-source audio language model. The model is an audio extension of the Llama 8B to support Indian language voice in and text out, which is more accurate than frontier models. 

“The audio serves as the input to the LLM, with audio tokens being the key component here. This approach is notably unique. It’s somewhat similar to GPT-4o introduced by OpenAI a couple of months ago,” Raghavan said.

According to Sarvam, the model is 6x faster than Whisper + Llama 3. At the same time, its accuracy across the ten languages is higher compared to Whisper + Llama 3.This is just the beginning. Sarvam models offer a range of other AI models that are available as APIs for developers. These include Mayura for translation, Bulbul for text-to-speech conversion, Saaras for speech-to-text with translation, and Saarika, which focuses on speech-to-text capabilities.

📣 Want to advertise in AIM? Book here

Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts
Association of Data Scientists
GenAI Corporate Training Programs
Our Upcoming Conference
India's Biggest Conference on AI Startups
April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.