rhinohoof

rhinohoof

I tried some code generation models on huggingface but they were really poor in the responses I got even though I clearly explained what I need in the prompt. My assumption is that it was because my question is related to a niche framework and the model was trained on a large dataset on a wide variety of languages and may not have come across the framework I’m working with. I’m not looking for a general model but one that is specific to the not-so-popular framework I work with, so I’m guessing I’ll have to generate a custom dataset.

I also don’t need the model to know so many languages. If I can get it to generate just Python, JavaScript, Golang, and C, that alone would be great but I can do with fewer languages as well. So, does this mean I’ll end up with a smaller model suitable for inference on an RTX4090?

How will it understand what I am asking it? Do I also need to scrape Stackoverflow and some forums for the specific language tags I am interested in?

How do I go about creating such a dataset? I can scrape from multiple sources but in what format am I supposed to put it all together for training?

I am doing this for the first time.

Resources for creating datasets for code generation?

Resources for creating datasets for code generation?