ShareGPT4V - New multi-modal model, improves on LLaVA

Cradawx · 2 years ago

metalman123 · 2 years ago

This style of captioning could be amazing for text to image datasets and i wouldn’t be surprised to see them take a jump in quality as well.

StraightChemistry629 · 2 years ago

This looks good. Imagine this thing quantized. Pretty please u/The-Bloke make it possible.

GeraltOfRiga · 2 years ago

This is kinda nuts (first time I try a LLM + vision)

Tried with a first person shooter screenshot, enemy on screen. Asked to give me the 2D coordinates of the enemy and it did, precisely.