Has anyone tried to combine a server with a moderately powerful GPU with a server with a lot of RAM to run inference? Especially with llama. Cpp where you can offload just some of the layers to GPU?
Has anyone tried to combine a server with a moderately powerful GPU with a server with a lot of RAM to run inference? Especially with llama. Cpp where you can offload just some of the layers to GPU?