You must log in or register to comment.
This style of captioning could be amazing for text to image datasets and i wouldn’t be surprised to see them take a jump in quality as well.
This looks good. Imagine this thing quantized. Pretty please u/The-Bloke make it possible.
This is kinda nuts (first time I try a LLM + vision)
Tried with a first person shooter screenshot, enemy on screen. Asked to give me the 2D coordinates of the enemy and it did, precisely.