Hi when running the AI models i notice that the amount of ram used is a lot lot less than claimed, however the performance greatly differs based on the amount of ram the machine has. My machines have limited number of ram slots but based on the behaviour of running the models, are the models cached into ram instead? for instance if i run llama 70B i have to rent an expensive aws ec2 instead but the responses greatly differ. If i use 13B it does not answer the question and i get the question back but it does with 70B. I still would like to be able to run 70B and i am using a similar chip architecture with avx_vnni. If there isnt enough ram would it be possible to create a ram drive split across multiple machines and use 10Gb/s NICs? I have used SFP+ NICs and SFP+ slots in my switch.

Are there ways to speed up running larger models with less memory without quantising them with lower accuracy?