vatsadev

As I was asking above, I’ve been looking at the Fuyu 8b model, and I’ve been able to break it down to

model takes in text the regular way, text -> tokens -> embeddings
it also takes image -> embeddings
it has a vanilla decoder, so only text comes out, they add special tokens around images, so i’m assuming the decoder ignores output images

So, from what I know, nn.Linear takes in a tensor and makes embeddings of your choice size. I not really sure with everything else though.

Since the linear layer just makes embeddings, does something like this even need training for the image encoder?
nn.Linear takes tensors as input, and they split an image into patches, so I’m assuming those patches are made into tensors. How do you turn an image into a tensor? A code snippet of image-embedding-image would be nice if possible
While Fuyu does not output images, wouldn’t the model hidden state be making image or image-like embeddings? Could you generate images if you had an image decoder?

[D] How Exactly does Fuyu's image to embedding with nn.Linear work? Could you do more with it?