GitHub - S-LoRA/S-LoRA: S-LoRA: Serving Thousands of Concurrent LoRA Adapters

AutomataManifold · 1 year ago

I think its work remembering that while the really big models take a lot of VRAM, they also quantize down to smaller sizes, so the numbers are slightly misleading.

AutomataManifold · 1 year ago

If you have a really old CPU, it will be a bottleneck, because there’s some CPU involvement at inference time. I had a 3090 on an old server CPU with lots of cores but a slow clock speed and it got about half the expected speed. (Newer inference engines like Exllama might have addressed this, but I haven’t tested.) But, I should stress, that’s a CPU from 8 years ago.

I don’t have benchmarks for current gen CPUs; I imagine that they’re similar to each other. I’d be more worried about physical space for the cards, power draw, PCI lanes, etc.

AutomataManifold · 1 year ago

I know there’s several projects for finetuning llama for Chinese. I haven’t worked on them but it might be worth looking in to what they did.

AutomataManifold · 1 year ago

It’s fairly easy to get it to talk to the continuation endpoint in the server for text-generation-webui or llama.cpp instead of OpenAI; actually the painful part was reformatting it to use an instruction format. Just plugging it in to the chat endpoint might work better.

Just prefixing the prompt with some random facts about a fictional world is enough to steer the generation in a way that makes the conversations mention enough stuff about your world to generate a few hundred thousand high-quality conversations with a 13B Llama model. They look like they’re pretty diverse, but obviously I haven’t had time to train anything on the generated data.

That’s probably enough for most applications. Next level is probably generating a world-specific symbolic knowledge distillation so it include elves and dragons in the source. That looks like it requires more accuracy, but they got good enough results with GPT-3 so it’s probably feasible. A lot of applications will probably be fine with just generating custom Sodaverse data.

AutomataManifold · 1 year ago

They provide the source code for generating your own dialog datasets. Interesting.

https://github.com/skywalker023/sodaverse

AutomataManifold · 1 year ago

Early research suggested that there was an inflection point below 4-bits, where things got markedly worse. In my personal use, I find that accuracy definitely suffers below there, though maybe modern quants are a bit better at it.

34B Yi does seem like a sweet spot, though I’m starting to suspect that we need some fine-tunes that use longer stories as part of the training data, because it doesn’t seem to be able to maintain the quality for the entire length of the context. Still, being able to include callbacks to events from thousands of tokens earlier is impressively practical. I’ve been alternating between a fast 13B (for specific scenes), 34B Yi (for general writing), and 70B (for when you need it to be smart and varied). And, of course, just switching models can help with the repetition sometimes.

AutomataManifold · 1 year ago

It’s not clear if this is testing the chat model or the base model. Assuming it is the base model, it isn’t surprising: it’s just a text completion model with no extra frills. The point of the safety alignment training is that it’s part of the instruct dataset and training, not the base model.

This is what you want, even if you’re concerned about safety. You don’t want the safety to be baked in to the raw completion model: if some future better way comes along to do safety training, you want to be able to use it without retraining the entire model from scratch. (And given the speed at which this stuff moves, that might be just a week from now.)

Of course, if you’re concerned about safety you shouldn’t be deploying the raw text completion model to end users. (For a whole host of reasons, not just safety.)

AutomataManifold · 1 year ago

I’ve been having trouble getting it to run with exllama2_HF in text-gen-webui. Did you run in to any issues?

AutomataManifold · 1 year ago

So how do I use this on my own dataset?

AutomataManifold · 1 year ago

Just to note: don’t read too much into OpenAI’s prices. They’re deliberately losing money as a market-capturing strategy, so it’s not guaranteed that there’s a linear relationship between what they charge for a given service and what their actual costs are.

AutomataManifold · 1 year ago

That’s a good point about few-shot prompting: the big thing about GPT-3 and instruction training was that it allowed for zero-shot prompting (i.e., prompting with zero examples). But if we’re manually prompting a base model, there’s no reason not to provide those examples, and you get dramatically improved performance versus the same model with no examples.

AutomataManifold · 1 year ago

GitHub - S-LoRA/S-LoRA: S-LoRA: Serving Thousands of Concurrent LoRA Adapters

AutomataManifold · 2 years ago

Claude has two big things: very long context length and high understanding (or whatever we want to call it).

The context length is the hardest part at the moment, I think. Though understanding is hard to measure.

AutomataManifold · 2 years ago

We really could use some more LoRAs, there’s no obvious central repository right now.

I’m not sure about KoboldAI…I think koboldcpp has LoRA support, but I’m not sure about the KoboldAI interface itself.