Ever since the medusa models were released, I’ve been wondering if speculative sampling can run effectively on CPU only. Modern GPUs already provide fast t/s, so the speedup is more exciting when running on low bandwidth GPUs, SoCs, and CPUs.
And that depends on batched decoding working correctly. So I did tests with the largest available model running directly from storage (no ram needed), and also a 13B.
Falcon 180B Q4_K_S (mmap inference)
./batched Falcon-180B-Q4_K_S.gguf "my best" 8
batch size | tg | total |
---|---|---|
1 | 0.05 t/s | decoded 5 tokens in 110.76s |
2 | 0.09 t/s | decoded 10 tokens in 117.22s |
4 | 0.17 t/s | decoded 20 tokens in 114.95s |
8 | 0.31 t/s | decoded 40 tokens in 117.94s |
16 | 0.64 t/s | decoded 80 tokens in 124.36s |
32 | 0.99 t/s | decoded 160 tokens in 161.40s |
64 | 1.33 t/s | decoded 320 tokens in 240.06s |
Falcon 180B f16 (mmap inference)
./batched ggml-model-f16.gguf "my best" 8
batch size | tg | total |
---|---|---|
1 | 0.01 t/s | decoded 5 tokens in 457.86s |
2 | 0.02 t/s | decoded 10 tokens in 452.00s |
16 | 0.17 t/s | decoded 160 tokens in 474.16s |
13B Q4_K_M (standard inference)
./batched llama-2-13B.gguf "my best" 120
batch size | TG t/s |
---|---|
1 | 5.4 |
2 | 10.5 |
3 | 14.7 |
4 | 18.1 |
5 | 20.3 |
6 | 22.8 |
8 | 24.7 |
10 | 26.6 |
16 | 25.9 |
So these results show double, triple… much higher t/s. I time them in real life too… to be sure results are accurate
Since exl2 already provides verifiable gains consistent with literature (2-3x speed) on most 70B, and batched CPU inference also scales the same as a gpu would, speculative CPU inference (llama.cpp) could probably be able to do the same (2-3x speeds), despite the current experience (slower).