Researchers Unveil AudioX—AI Model That Converts Anything to Audio, Music

A new research paper introduces AudioX as a diffusion transformer model that enables various audio generation capabilities.
Call Centers

Researchers from the Hong Kong University of Science and Technology and Moonshot AI have teased a new AI model called AudioX, that generates audio and music using multimodal inputs.

AudioX is described as a unified model offering flexible natural language control and seamless processing of inputs that include text, video, image, music, and audio. This differs from the standard domain-specific models that typically focus on a single modality or a limited set of input conditions.

The research paper mentioned use cases like text-to-audio, text-and-video-to-audio, and video-to-audio with AudioX. Notably, the AI model also lets one refine existing audio through a text prompt, improve unprocessed music, and generate music from scratch.

Netizens seem excited about the demo of the model shared on the model’s GitHub repo, highlighting interesting use cases like generating audio for a tennis video:

The researchers mentioned that they aim to address the scarcity of high-quality multi-modal data, which has been a major bottleneck in the development of versatile audio generation systems. To tackle this, they curated two comprehensive datasets: vggsound-caps, with 190K audio captions based on the VGGSound dataset, and V2M-caps, with 6 million music captions derived from the V2M dataset.

“Extensive experimental results show that AudioX not only excels in intra-modal tasks but also significantly improves inter-modal performance, highlighting its potential to advance the field of multi-modal audio generation,” the research paper stated.

Currently, the code for the model is not available. The researchers mentioned it would be available on the GitHub page without specifying a timeframe or licence details.

There are various text-to-music models and some text-to-speech models available, which have seen creative use cases in the AI space. It remains to be seen how AudioX opens up more possibilities.

📣 Want to advertise in AIM? Book here

Picture of Ankush Das

Ankush Das

I am a tech aficionado and a computer science graduate with a keen interest in AI, Open Source, and Cybersecurity.
Related Posts
Association of Data Scientists
GenAI Corporate Training Programs
Our Upcoming Conference
India's Biggest Conference on AI Startups
April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.