Best LVLM and LM designed for sound generation

platapus100 · 10 months ago

Best LVLM and LM designed for sound generation

Dead_Internet_Theory · 10 months ago

The thing is, as far as I’m aware, “sound generation” is always a separate TTS thing cobbled together, and even “vision” is a separate thing that describes the image for the AI.

This 13b model is probably still state of the art in the vision department for open models, a few crop up now and again but they didn’t surprise me much.
https://llava-vl.github.io/

If you need to recognize audio, check Whisper, or Faster-Whisper, or anything developed from that. If you need to generate voice, check Bark, maybe Silero, RVC, etc.

You probably won’t find it all wrapped into one neat package like ChatGPT+ right now, but I’d love to be proven wrong.