• 1 Post
  • 13 Comments
Joined 1 year ago
cake
Cake day: November 24th, 2023

help-circle




    1. It’s not bad at all! I just wanted to see full model. The approach can be applied to quantized models too, I just wanted the most extreme example in terms of model and context size. It only gets better from there! Light quantization + speculative decoding gets you close to real-time.

    2. Quantized would run significantly faster, although I haven’t measured it extensively yet. That is because you avoid most of the data transfer and also the layers take a lot less memory and run much faster themselves.

    3. The model is definitely not the best, but what was important for me was to see something that’s close to GPT-3.5 in terms of size. So I have a blueprint for running newer open source models of similar sizes.







  • Thanks for sharing, that’s very useful! What GPUs and how many are you using, just to make sure I understand correctly?

    EDIT: What CPU are you using? Because 90s/t is pretty impressive to be honest.

    The layer method basically uses the time when the node is idle, so it works on large context sizes or if you have many GPUs (so you can load a small number layers on the GPU and can reload them super fast).


  • The intended use case long-term is extracting data from documents. One document is typically around 1500 tokens. Since I know the output should be contained in the original document, I restrict the output to predefined choices from the document and a single pass gives me the choice with the highest probability. This way I do not expose my data and it is actually faster than OpenAI API, because there I cannot restrict the output to just a few tokens and it goes on to write irrelevant stuff. Moreover, the data is very sensitive and I obviously cannot send it to an external service just like that. With this fully local approach of less than 10k USD one-time cost, I am able to process about 100k documents per month, which is good enough for now. Not only that, because it’s a one-time cost, it’s way cheaper than OpenAI API in the long run, as it pays off in just 2-3 months.