Art2Music: AI Harmonizes Images and Text into Melodies

In the rapidly evolving world of AI-generated content, creating music that feels naturally aligned with visual and textual inputs is a significant challenge. Traditional methods often depend on explicit emotion labels, which can be expensive and time-consuming to annotate. To address this, researchers Jiaying Hong, Ting Zhu, Thanet Markchom, and Huizhi Liang have developed a novel approach called Art2Music, which generates music from artistic images and user comments without relying on costly annotations.

The team constructed ArtiCaps, a pseudo feeling-aligned image-music-text dataset, by semantically matching descriptions from ArtEmis and MusicCaps. This dataset serves as the foundation for their lightweight cross-modal framework, Art2Music. The framework operates in two stages. In the first stage, images and text are encoded using OpenCLIP and fused using a gated residual module. The fused representation is then decoded by a bidirectional LSTM into Mel-spectrograms, with a frequency-weighted L1 loss applied to enhance high-frequency fidelity. In the second stage, a fine-tuned HiFi-GAN vocoder reconstructs high-quality audio waveforms.

The researchers evaluated Art2Music on the ArtiCaps dataset and observed improvements in several metrics, including Mel-Cepstral Distortion, Frechet Audio Distance, Log-Spectral Distance, and cosine similarity. These results indicate enhanced perceptual naturalness, spectral fidelity, and semantic consistency. Additionally, a small LLM-based rating study confirmed consistent cross-modal feeling alignment and provided interpretable explanations for matches and mismatches across modalities.

One of the standout features of Art2Music is its robustness. The framework maintains strong performance even with only 50k training samples, making it a scalable solution for feeling-aligned creative audio generation. This advancement opens up exciting possibilities for interactive art, personalized soundscapes, and digital art exhibitions. By enabling the creation of music that resonates with visual and textual inputs, Art2Music could revolutionize the way we experience and interact with multimedia content.

Related Posts