Published on March 20, 2025
In AI News

Researchers Unveil AudioX—AI Model That Converts Anything to Audio, Music

A new research paper introduces AudioX as a diffusion transformer model that enables various audio generation capabilities.

by Ankush Das

Researchers from the Hong Kong University of Science and Technology and Moonshot AI have teased a new AI model called AudioX, that generates audio and music using multimodal inputs.

AudioX is described as a unified model offering flexible natural language control and seamless processing of inputs that include text, video, image, music, and audio. This differs from the standard domain-specific models that typically focus on a single modality or a limited set of input conditions.

The research paper mentioned use cases like text-to-audio, text-and-video-to-audio, and video-to-audio with AudioX. Notably, the AI model also lets one refine existing audio through a text prompt, improve unprocessed music, and generate music from scratch.

Netizens seem excited about the demo of the model shared on the model’s GitHub repo, highlighting interesting use cases like generating audio for a tennis video:

AudioX : Anything-to-Audio Generation

Mindblowing, I could not believe that tennis example it was just too good. pic.twitter.com/EA8clWlqmF
— AshutoshShrivastava (@ai_for_success) March 19, 2025

The researchers mentioned that they aim to address the scarcity of high-quality multi-modal data, which has been a major bottleneck in the development of versatile audio generation systems. To tackle this, they curated two comprehensive datasets: vggsound-caps, with 190K audio captions based on the VGGSound dataset, and V2M-caps, with 6 million music captions derived from the V2M dataset.

“Extensive experimental results show that AudioX not only excels in intra-modal tasks but also significantly improves inter-modal performance, highlighting its potential to advance the field of multi-modal audio generation,” the research paper stated.

Currently, the code for the model is not available. The researchers mentioned it would be available on the GitHub page without specifying a timeframe or licence details.

There are various text-to-music models and some text-to-speech models available, which have seen creative use cases in the AI space. It remains to be seen how AudioX opens up more possibilities.

📣 Want to advertise in AIM? Book here

Ankush Das

I am a tech aficionado and a computer science graduate with a keen interest in AI, Open Source, and Cybersecurity.

Foxconn Unveils FoxBrain—Chinese AI Model Poised for Open Source Release

Security Researchers Issue Stark Warning: Do Not Use DeepSeek-R1

Shanghai AI Laboratory Unveils NeedleBench, a New Framework to Test Long-Context Capabilities of Large Language Models

Researchers Unveil MiraData for Longer Video Generation With Structured Captions

New AI Model Revolutionises Portrait Animation with Enhanced Control and Realism

Elon Musk’s xAI To Launch its First AI Model

What to Expect from Google’s Gemini?

Bias-Variance Tradeoff is Killing Your AI Models

Association of Data Scientists

GenAI Corporate Training Programs

Our Upcoming Conference

Happy Llama 2025

India's Biggest Conference on AI Startups

April 25, 2025 | 📍 Hotel Radisson Blu, Bengaluru

Download the easiest way to
stay informed

‘Most Data Centres Are Not Ready for Liquid Cooling’, says Oracle Exec on NVIDIA Blackwell

Siddharth Jindal

Built on the Blackwell architecture introduced last year, Blackwell Ultra features the NVIDIA GB300 NVL72 rack-scale solution and the NVIDIA HG B300 NVL16 system.