I spun up a simple project (home surveillance system) to play around with ShareGPT4V-7B and made quite a bit of progress over the last few days. However, I’m having a really hard time figuring out how to send a simple prompt along with the image-to-text request. Here is the relevant code:
document.getElementById('send-chat').addEventListener('click', async () => { const
message = document.getElementById('chat-input').value;
appendUserMessage(message);
document.getElementById('chat-input').value = '';
const imageElement = document.getElementById('frame-display');
const imageUrl = imageElement.style.backgroundImage.slice(5, -2);
try {
const imageBlob = await fetch(imageUrl).then(res => res.blob());
const reader = new FileReader();
reader.onloadend = async () => {
const base64data = reader.result.split(',')[1];
const imageData = {
data: base64data,
id: 1
};
const payload = {
prompt: message,
image_data: [imageData],
n_predict: 256,
top_p: 0.5,
temp: 0.2
};
const response = await fetch("http://localhost:8080/completion", {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(payload)
});
const data = await response.json();
console.log(data);
appendAiResponse(data.content);
};
reader.readAsDataURL(imageBlob);
} catch (error) {
console.error('Error encoding image or sending request:', error);
}
});
The only thing that works is sending an empty space or sometimes a question mark and i’ll get a general interpretation of the image but what I really want is to be able to instruct the model so it knows what to look for. Is that something that’s currently possible? basically system prompting the vision model.
Doesn’t the LlamaCpp server host a GUI for multimodal? You could potentially visit it, open the developer panel in your browser, and observe the HTTP requests being sent.
I ended up just scrutinizing the server code to understand it better and found that the prompt needs to follow a very specific format or else it won’t work well:
prompt: \
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human’s questions.\nUSER:[img-12]${message}\nASSISTANT:``