Running Multiple WebUI instances (follow up from my question yesterday)

multiverse_fan · 2 years ago

Running Multiple WebUI instances (follow up from my question yesterday)

multiverse_fan · 2 years ago

What would have happened if ChatGPT was invented in the 17th century? MonadGPT is a possible answer.

TheBloke/MonadGPT-GGUF

multiverse_fan · 2 years ago

I have an older 6GB 1660 and get like 0.3 t/s on a q2 quant of Goliath 120B. I guess I’m just thinking that comparatively your setup with a 20B model should be faster than that but I’m sure I’m missing something. I guess with offloading, the CPU plays a role as well. How many cores ya got?

multiverse_fan · 2 years ago

Anyone have a 1B or 3B model that is mostly coherent?

multiverse_fan · 2 years ago

If I had the money, I’d go with the cpu.

Also, I’m not sure a 4090 could run 33B modes at full precision. Wouldn’t that require like 70GB of vRAM?

multiverse_fan · 2 years ago

Goliath was created by merging layers of Xwin and Euryale. (from their model card)

The layer ranges used are as follows:
- range 0, 16 Xwin 
- range 8, 24 Euryale 
- range 17, 32 Xwin 
- range 25, 40 Euryale 
- range 33, 48 Xwin 
- range 41, 56 Euryale 
- range 49, 64 Xwin 
- range 57, 72 Euryale 
- range 65, 80 Xwin

I’m not sure how the model would be reduced to 70B unless it’s through removing layers. Is that what “shearing” is? I don’t understand what is being pruned in that, is it layers?

multiverse_fan · 2 years ago

Cool, sounds like a good model to download and store for future when I can get access to better hardware.