I’m pretty knew here so apologies if I’m coming off green with the request ahead of time.
Im looking to see what the best options for running a LVLM (any LLM with visual recognition capabilities like supplying it an image, etc) locally. Bonus points for anything that can also be helpful with video / gif generation
And any (if at all) LM’s that do work with sound / voice recognition too that can be run locally.
The thing is, as far as I’m aware, “sound generation” is always a separate TTS thing cobbled together, and even “vision” is a separate thing that describes the image for the AI.
This 13b model is probably still state of the art in the vision department for open models, a few crop up now and again but they didn’t surprise me much.
https://llava-vl.github.io/If you need to recognize audio, check Whisper, or Faster-Whisper, or anything developed from that. If you need to generate voice, check Bark, maybe Silero, RVC, etc.
You probably won’t find it all wrapped into one neat package like ChatGPT+ right now, but I’d love to be proven wrong.