Aaaaaaaaaeeeee

Aaaaaaaaaeeeee

Ever since the medusa models were released, I’ve been wondering if speculative sampling can run effectively on CPU only. Modern GPUs already provide fast t/s, so the speedup is more exciting when running on low bandwidth GPUs, SoCs, and CPUs.

And that depends on batched decoding working correctly. So I did tests with the largest available model running directly from storage (no ram needed), and also a 13B.

Falcon 180B Q4_K_S (mmap inference)

./batched Falcon-180B-Q4_K_S.gguf "my best" 8

batch size	tg	total
1	0.05 t/s	decoded 5 tokens in 110.76s
2	0.09 t/s	decoded 10 tokens in 117.22s
4	0.17 t/s	decoded 20 tokens in 114.95s
8	0.31 t/s	decoded 40 tokens in 117.94s
16	0.64 t/s	decoded 80 tokens in 124.36s
32	0.99 t/s	decoded 160 tokens in 161.40s
64	1.33 t/s	decoded 320 tokens in 240.06s

Falcon 180B f16 (mmap inference)

./batched ggml-model-f16.gguf "my best" 8

batch size	tg	total
1	0.01 t/s	decoded 5 tokens in 457.86s
2	0.02 t/s	decoded 10 tokens in 452.00s
16	0.17 t/s	decoded 160 tokens in 474.16s

13B Q4_K_M (standard inference)

./batched llama-2-13B.gguf "my best" 120

batch size	TG t/s
1	5.4
2	10.5
3	14.7
4	18.1
5	20.3
6	22.8
8	24.7
10	26.6
16	25.9

So these results show double, triple… much higher t/s. I time them in real life too… to be sure results are accurate

Since exl2 already provides verifiable gains consistent with literature (2-3x speed) on most 70B, and batched CPU inference also scales the same as a gpu would, speculative CPU inference (llama.cpp) could probably be able to do the same (2-3x speeds), despite the current experience (slower).

Tested: Batched decoding on CPU

Tested: Batched decoding on CPU

Falcon 180B Q4_K_S (mmap inference)

Falcon 180B f16 (mmap inference)

13B Q4_K_M (standard inference)