As I was asking above, I’ve been looking at the Fuyu 8b model, and I’ve been able to break it down to

  • model takes in text the regular way, text -> tokens -> embeddings
  • it also takes image -> embeddings
  • it has a vanilla decoder, so only text comes out, they add special tokens around images, so i’m assuming the decoder ignores output images

So, from what I know, nn.Linear takes in a tensor and makes embeddings of your choice size. I not really sure with everything else though.

  • Since the linear layer just makes embeddings, does something like this even need training for the image encoder?
  • nn.Linear takes tensors as input, and they split an image into patches, so I’m assuming those patches are made into tensors. How do you turn an image into a tensor? A code snippet of image-embedding-image would be nice if possible
  • While Fuyu does not output images, wouldn’t the model hidden state be making image or image-like embeddings? Could you generate images if you had an image decoder?