OpenAI Releases New Audio Models to Power Voice Agents

The company said these advancements stem from reinforcement learning techniques and extensive training with diverse audio datasets.
OpenAI is Trying Really Hard to Attract Young Talent

OpenAI has launched new speech-to-text and text-to-speech models in its API, providing developers with tools to build advanced voice agents. These models improve transcription accuracy and introduce customisation options for generated speech.

The new speech-to-text models, gpt-4o-transcribe and gpt-4o-mini-transcribe, improve word error rate and language recognition compared to Whisper models. 

In its blog post, OpenAI said these advancements stem from reinforcement learning techniques and extensive training with diverse audio datasets. The models aim to improve transcription reliability in noisy environments, varying speech speeds, and different accents.

“Our latest speech-to-text models achieve lower word error rates across established benchmarks, reflecting improvements in transcription accuracy and language coverage,” OpenAI said.

Developers can now also control how the text-to-speech model speaks. The gpt-4o-mini-tts model allows developers to instruct the model to adopt different speaking styles, such as mimicking a customer service agent. This feature expands use cases in customer interactions and creative storytelling. However, OpenAI clarified that these models are limited to synthetic preset voices.

The company credits improvements in its audio models to pretraining with authentic datasets, advanced distillation methodologies, and reinforcement learning. Distillation techniques have enabled smaller models to retain conversational quality while reducing computational costs.

The new models are available to all developers through OpenAI’s API. OpenAI has also integrated these models with its Agents SDK to simplify development. For real-time, low-latency speech-to-speech applications, OpenAI recommends using its Realtime API.

Looking ahead, OpenAI plans to enhance the intelligence and accuracy of its audio models and explore custom voice options. The company is also engaging with policymakers, researchers, and developers on the implications of synthetic voices. Moreover, OpenAI intends to expand into video, enabling multimodal agentic experiences.

📣 Want to advertise in AIM? Book here

Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts
Association of Data Scientists
GenAI Corporate Training Programs
Our Upcoming Conference
India's Biggest Conference on AI Startups
April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.