Has anyone tried to combine a server with a moderately powerful GPU with a server with a lot of RAM to run inference? Especially with llama. Cpp where you can offload just some of the layers to GPU?
You must log in or register to comment.
Has anyone tried to combine a server with a moderately powerful GPU with a server with a lot of RAM to run inference? Especially with llama. Cpp where you can offload just some of the layers to GPU?