have been thinking about this for a while-- does anyone know how feasible this is? Basically just applying some sort of “LoRa” on top of models to give them vision capabilities-- making then multimodal.