Hello all, I’d love your help to think through this problem. Will briefly describe the problem followed by possible solutions. Would love thoughts/feedback

Problem: I have a bunch of keywords about a product I want to classify into a certain set of categories. I can provide a description of the product and give examples of all the categories too. Specifically I want to identify the irrelevant keywords.

Now, I have a lot of products (let’s say 500) and 100000 keywords/product.

Solutions I’m considering:

  1. Fancy prompt engineering with either function calls/parsing with Gpt4 giving few shot examples. Feel it can become expensive to pass a large prompt (so might need to pass several keywords at a time)
  2. Use embeddings cosine distance to help me classify keywords
  3. Finetune a smaller opensource model on this where I reach a “keyword in, label out”

If the 3rd is suitable would love some direction, such as:

  • which model and size is best to finetune
  • do I train the model on each product or will it generalise well across products?
  • what dataset size would I require (keyword <> label pairs, ie)
  • resources/libraries/tools I should refer to?

TIA!