V2A-MAPPER

Updated 23 days ago
  • ID: 52909838/1
Our lightweight method only requires the training of a V2A-Mapper to bridge the domain gap between the vision representative FM CLIP and the audio generative FM AudioLDM. The V2A-Mapper is supervised by the audio representative FM CLAP to learn the translation from visual space to auditory space. Leveraging the generalization and knowledge transfer ability of foundation models, the V2A-Mapper is trained with the same modestly sized dataset but the overall system can achieve much better performance... In this project, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism V2A-Mapper to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is..
  • 0
  • 0
Interest Score
3
HIT Score
0.00
Domain
v2a-mapper.github.io

Actual
v2a-mapper.github.io

IP
185.199.108.153, 185.199.109.153, 185.199.110.153, 185.199.111.153

Status
OK

Category
Company
0 comments Add a comment