I’m pretty knew here so apologies if I’m coming off green with the request ahead of time.

Im looking to see what the best options for running a LVLM (any LLM with visual recognition capabilities like supplying it an image, etc) locally. Bonus points for anything that can also be helpful with video / gif generation

And any (if at all) LM’s that do work with sound / voice recognition too that can be run locally.

  • Dead_Internet_TheoryB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    The thing is, as far as I’m aware, “sound generation” is always a separate TTS thing cobbled together, and even “vision” is a separate thing that describes the image for the AI.

    This 13b model is probably still state of the art in the vision department for open models, a few crop up now and again but they didn’t surprise me much.
    https://llava-vl.github.io/

    If you need to recognize audio, check Whisper, or Faster-Whisper, or anything developed from that. If you need to generate voice, check Bark, maybe Silero, RVC, etc.

    You probably won’t find it all wrapped into one neat package like ChatGPT+ right now, but I’d love to be proven wrong.