I’d like to fine-tune a text-to-image model to know certain faces of people I’m working with. I’ve been experimenting a bit and I can get some images that are reminiscent of a person but really doesn’t look like them. I’m also needing to provide more in the prompt than I would expect.
For example, there is one person who is a big guy with a mustache and glasses. I fine-tuned using a few images of him with the caption being his actual name in the training dataset.
When I generate images with his name as the subject, none of the faces will have a mustache or glasses. If I prompt it “Mark Smith with mustache and glasses doing xyz” it does look slightly more reminiscent of him, but still not quite right.
What should my strategy be to improve this? Do I need more images of him? Should I hash his name (or similar) into a common caption to make sure other weights in the model are not interfering? Other ideas?
I realize I could experiment, but it’s very expensive to keep fine-tuning and I don’t want to go the wrong direction too many times.
This post is an automated archive from a submission made on /r/MachineLearning, powered by Fediverser software running on alien.top. Responses to this submission will not be seen by the original author until they claim ownership of their alien.top account. Please consider reaching out to them let them know about this post and help them migrate to Lemmy.
Lemmy users: you are still very much encouraged to participate in the discussion. There are still many other subscribers on !machinelearning@academy.garden that can benefit from your contribution and join in the conversation.
Reddit users: you can also join the fediverse right away by getting by visiting https://portal.alien.top. If you are looking for a Reddit alternative made for and by an independent community, check out Fediverser.