Researchers from the Hong Kong University of Science and Technology and Moonshot AI have teased a new AI model called AudioX, that generates audio and music using multimodal inputs.
AudioX is described as a unified model offering flexible natural language control and seamless processing of inputs that include text, video, image, music, and audio. This differs from the standard domain-specific models that typically focus on a single modality or a limited set of input conditions.

The research paper mentioned use cases like text-to-audio, text-and-video-to-audio, and video-to-audio with AudioX. Notably, the AI model also lets one refine existing audio through a text prompt, improve unprocessed music, and generate music from scratch.
Netizens seem excited about the demo of the model shared on the model’s GitHub repo, highlighting interesting use cases like generating audio for a tennis video:
AudioX : Anything-to-Audio Generation
— AshutoshShrivastava (@ai_for_success) March 19, 2025
Mindblowing, I could not believe that tennis example it was just too good. pic.twitter.com/EA8clWlqmF
The researchers mentioned that they aim to address the scarcity of high-quality multi-modal data, which has been a major bottleneck in the development of versatile audio generation systems. To tackle this, they curated two comprehensive datasets: vggsound-caps, with 190K audio captions based on the VGGSound dataset, and V2M-caps, with 6 million music captions derived from the V2M dataset.
“Extensive experimental results show that AudioX not only excels in intra-modal tasks but also significantly improves inter-modal performance, highlighting its potential to advance the field of multi-modal audio generation,” the research paper stated.
Currently, the code for the model is not available. The researchers mentioned it would be available on the GitHub page without specifying a timeframe or licence details.
There are various text-to-music models and some text-to-speech models available, which have seen creative use cases in the AI space. It remains to be seen how AudioX opens up more possibilities.