What is the best approach to achieve multi modality using a instruct fine tuned model?