I have 10k vulnerabilities found in around 100 C++ projects. For the culture I would like to try to train an LLM to, given a file, to highlight the vulnerabilities. Each vulnerability report contains:

  • a title and a description
  • a link to either a file or a particular line of the file (or more!)

I’m just thinking about it but I wonder how would I build the dataset. Ideally I would go by pairing the file concerned by the issue and the report. But AFAI understand the context window won’t allow me to put a 300ish long file with a 1k characters vulnerability report. Even if the context window wouldn’t be an issue the problem would be that multiple vulnerability reports be in the same file.

So maybe pairing on file with a list of vulnerabilities summaries and their lines would do the trick.

Just thinking out loud here. How would you do it? Am I missing something obvious?