Hello all, I’d love your help to think through this problem. Will briefly describe the problem followed by possible solutions. Would love thoughts/feedback
Problem: I have a bunch of keywords about a product I want to classify into a certain set of categories. I can provide a description of the product and give examples of all the categories too. Specifically I want to identify the irrelevant keywords.
Now, I have a lot of products (let’s say 500) and 100000 keywords/product.
Solutions I’m considering:
- Fancy prompt engineering with either function calls/parsing with Gpt4 giving few shot examples. Feel it can become expensive to pass a large prompt (so might need to pass several keywords at a time)
- Use embeddings cosine distance to help me classify keywords
- Finetune a smaller opensource model on this where I reach a “keyword in, label out”
If the 3rd is suitable would love some direction, such as:
- which model and size is best to finetune
- do I train the model on each product or will it generalise well across products?
- what dataset size would I require (keyword <> label pairs, ie)
- resources/libraries/tools I should refer to?
TIA!
Yes something like this works, but the prompt is very large to run on 1000s of keywords. Hence looking for something better