I spun up a simple project (home surveillance system) to play around with ShareGPT4V-7B and made quite a bit of progress over the last few days. However, I’m having a really hard time figuring out how to send a simple prompt along with the image-to-text request. Here is the relevant code:

document.getElementById('send-chat').addEventListener('click', async () => {  const       

  message = document.getElementById('chat-input').value;
  appendUserMessage(message);
  document.getElementById('chat-input').value = '';
  const imageElement = document.getElementById('frame-display');
  const imageUrl = imageElement.style.backgroundImage.slice(5, -2);

  try {
    const imageBlob = await fetch(imageUrl).then(res => res.blob());
    const reader = new FileReader();
    reader.onloadend = async () => {
    const base64data = reader.result.split(',')[1];

    const imageData = {
      data: base64data,
      id: 1
    };

    const payload = {
      prompt: message,
      image_data: [imageData],
      n_predict: 256,
      top_p: 0.5,
      temp: 0.2
    };

    const response = await fetch("http://localhost:8080/completion", {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify(payload)
    });

    const data = await response.json();
    console.log(data);
    appendAiResponse(data.content);
  };
  reader.readAsDataURL(imageBlob);
  } catch (error) {
    console.error('Error encoding image or sending request:', error);
  }
});

The only thing that works is sending an empty space or sometimes a question mark and i’ll get a general interpretation of the image but what I really want is to be able to instruct the model so it knows what to look for. Is that something that’s currently possible? basically system prompting the vision model.

  • paryska99B
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Doesn’t the LlamaCpp server host a GUI for multimodal? You could potentially visit it, open the developer panel in your browser, and observe the HTTP requests being sent.

    • LyPretoOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I ended up just scrutinizing the server code to understand it better and found that the prompt needs to follow a very specific format or else it won’t work well:

      prompt: \A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human’s questions.\nUSER:[img-12]${message}\nASSISTANT:``