oinkyDoinkyDoink

oinkyDoinkyDoink

Hello all, I’d love your help to think through this problem. Will briefly describe the problem followed by possible solutions. Would love thoughts/feedback

Problem: I have a bunch of keywords about a product I want to classify into a certain set of categories. I can provide a description of the product and give examples of all the categories too. Specifically I want to identify the irrelevant keywords.

Now, I have a lot of products (let’s say 500) and 100000 keywords/product.

Solutions I’m considering:

Fancy prompt engineering with either function calls/parsing with Gpt4 giving few shot examples. Feel it can become expensive to pass a large prompt (so might need to pass several keywords at a time)
Use embeddings cosine distance to help me classify keywords
Finetune a smaller opensource model on this where I reach a “keyword in, label out”

If the 3rd is suitable would love some direction, such as:

which model and size is best to finetune
do I train the model on each product or will it generalise well across products?
what dataset size would I require (keyword <> label pairs, ie)
resources/libraries/tools I should refer to?

TIA!

Keyword Labeling/Classification System

Keyword Labeling/Classification System