Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)

ttkciar · 2 years ago

Docker and Kubernetes are popular mostly because the industry has broadly given up on release engineering. This means applications/services can have different and conflicting dependencies, so the only way they can run on the same physical host is by putting each in their own containers or VM instances, each with their specific dependencies.

The alternative is to have a platform with standard libraries, and to port applications to the platform, using the platform’s libraries as their dependencies, and thus avoid conflict. This requires effort and discipline, so of course it is not very popular, though it was the standard practice twenty years ago.

As far as I know the only Linux distribution which still follows the platform approach is Slackware. Applications which are ported to Slackware are guaranteed to work well together without conflicts, but not a lot of applications have been thus ported (Slackware only has about two thousand official packages, in all).

ttkciar · 2 years ago

Mostly I’m still using slightly older models, with a few slightly newer ones now:

marx-3b-v3.Q4_K_M.gguf for “fast” RAG inference,
medalpaca-13B.ggmlv3.q4_1.bin for medical research,
mistral-7b-openorca.Q4_K_M.gguf for creative writing,
NousResearch-Nous-Capybara-3B-V1.9-Q4_K_M.gguf for creative writing, and probably for giving my IRC bots conversational capabilities (a work in progress),
puddlejumper-13b-v2.Q4_K_M.gguf for physics research, questions about society and philosophy, “slow” RAG inference, and translating between English and German,
refact-1_6b-Q4_K_M.gguf as a coding copilot, for fill-in-the-middle,
rift-coder-v0-7b-gguf.git as a coding copilot when I’m writing python or trying to figure out my coworkers’ python,
scarlett-33b.ggmlv3.q4_1.bin for creative writing, though less than I used to.

I also have several models which I’ve downloaded but not yet had time to evaluate, and am downloading more as we speak (though even more slowly than usual; a couple of weeks ago my download rates from HF dropped roughly in third, and I don’t know why).

Some which seem particularly promising:

yi-34b-200k-llamafied.Q4_K_M.gguf
rocket-3b.Q4_K_M.gguf
llmware’s “bling” and “dragon” models. I’m downloading them all, though so far there are only GGUFs available for three of them. I’m particularly intrigued at the prospect of llmware-dragon-falcon-7b-v0-gguf which is tuned specifically for RAG and is supposedly “hallucination-proofed”, and llmware-bling-stable-lm-3b-4e1t-v0-gguf which might be a better IRC-bot conversational model.

Of all of these, the one I use most frequently is PuddleJumper-13B-v2.

ttkciar · 2 years ago

If you are using llama.cpp, you might want to give it a grammar which forces ASCII output.

ttkciar · 2 years ago

This is a fine-tune of Goliath-120B.

Didn’t the author hypothesize that Goliath’s interleaving of rows would degrade inference quality until it had been fine-tuned?

It will be interesting to see if this fine-tune supports that hypothesis.

Waiting for GUFF.

ttkciar · 2 years ago

Sure! I’ve been doing a few LLM’ing things in Perl:

A previous project, implemented in Perl, indexes a local wikipedia dump in Lucy and allows searching for pages. I’ve been reusing that project for RAG inference.
My “infer” utility is written in Perl. It wraps llama.cpp’s “main” utility with IPC::Open3 and I’m using it for inference, for RAG, for stop-words, for matching prompt templates to models, and for summarization. It’s gloriously broken at the moment and in dire need of a refactor.
I recently started writing a “copilot” utility in Perl, to better streamline using inference for research and code-writing copilots. It also wraps llama.cpp’s “main”, but in a much more simple way than “infer” (blocking I/O, no stop words, not trying to detect when the LLM infers the prompt text, etc).

If you’re more interested in using the existing Python libraries and not wrapping llama.cpp, you should take a look at the Inline::Python module. I’ve only dabbled with LangChain, but if/when I get back to it, I will probably implement Perl bindings with a simple Inline::Python wrapper. It makes it pretty easy.

If you do decide to wrap llama.cpp, you might be more comfortable with IPC::Run rather than IPC::Open3. It’s considered the more modern module. I’m just using IPC::Open3 out of familiarity.

ttkciar · 2 years ago

I’m mostly using Bash and Perl.

ttkciar · 2 years ago

My work-from-home workstation always has a VM or two running the test/dev environment for the tasks I’m working on at work. They are VBox instances provisioned/managed by Vagrant.

They are CentOS7 instances, each running a test database, usually a text editor, “tail -F” monitoring log output, and various daemons/services specific to my workplace’s internal infrastructure. The host system is running Slackware 15.0.

ttkciar · 2 years ago

Yep, I came here to suggest this. Marx-3B-v3 seems to be an improvement over the original, too.

Also, I have only used it a little, but NousResearch-Nous-Capybara-3B-v1.9 has been very good for me, so far.

ttkciar · 2 years ago

Thanks! 🙏

Quite welcome :-)

Where do you get your information?

A few places:

Intel ARK for full CPU specs
CPUBenchmark for estimating perf and perf/watt
eBay for figuring out modern prices – E5-2660v3 can be had for about $15, E5-2698v4 for about $110
Also eBay for figuring out what servers are available for cheap with a particular socket interface
This very subreddit of course :-) y’all rock

ttkciar · 2 years ago

Older Xeon systems (v3, v4) give you oodles of cores, main memory channels, and PCIe lanes. Single-threaded performance isn’t great, but for multi-threaded workloads they’re great value for the money and power.

Compare those threadripper systems to R730, T7910, and T7810, with E5-2680 and E5-2690 processors, and see which makes sense to you and your use-case.

ttkciar · 2 years ago

I keep thinking of spinning up a website to track/provide this information, but then think “nah, someone else will do it” and focus on other things.

Usually TheBloke will have a prompt template in the model card, but other parameters need fiddling to figure out.

When there is no original model card at all, I just skip over the model, and sometimes leave a note for the publishers that they should fill out their model card.

ttkciar · 2 years ago

It is only 1.3B :-) I have noticed that smaller models work a lot better with longer, more detailed prompts (at least 440 characters, better with twice that many).

ttkciar · 2 years ago

When I look at the leaderboard, I mostly pay attention to TruthfulQA, as it seems most predictive of models which are good for my use-case. YMMV of course.

Once I’ve downloaded a model, I’ll fiddle around with different llama.cpp parameters and prompt templates, figuring out what works best for it, and then send it through my test framework, which has it infer five times on each of several prompts.

Evaluation of test results are fairly subjective, but there are some obvious problems which recur, like not inferring an answer, or inferring itself a new user prompt to answer.

I just finished a compare-and-contrast of Marx-3B vs Marx-3B-v3 using that test framework, which you can see (along with raw test results) here: https://old.reddit.com/r/LocalLLaMA/comments/17xsliz/marx_3b_v3_and_akins_3b_gguf_quantizations/ka2fd19/

I’ve been meaning to add some simple assessment logic to my test framework, which tries to guess at the quality of inferred replies, but haven’t made it a priority.

ttkciar · 2 years ago

I tested Marx-3B-v3 test on my laptop, using llama.cpp (commit dfc7cd48b1cc31d759c093e917a18c0efe03d0e8) and my usual test framework, which prompts a series of one-shots, inferring each prompt five times.

These tests are designed to cover a variety of use-cases, and models are not expected to do equally well on all use-cases. Also, they were written with larger models in mind (30B, 70B) and Marx-3B is much, much smaller than these, so we should not expect too much.

Marx-3B-v3 is prone to infer new user prompts, a problem I run into with some models. I’m not sure if the problem is intrinsic to the model, particular to the GGUF, or something in the llama.cpp params, but I haven’t figured out a good way to avoid them except to specify stop-words which abort inference (which my test framework does not yet support).

This critique compares the original Marx-3B with those of Marx-3B-v3, ignoring the extraneous user prompts inferred by Marx-3B-v3.

The raw test results are here:

http://ciar.org/h/test.1696148998.marx.txt

http://ciar.org/h/test.1700499482.marx3.txt

Test “creativity:arzoth”:

Creative writing, describing AD&D fantasy setting.

The original Marx-3B tended to repeat parts of the prompt back to the user, and provided little original content of its own. What original content it did infer was not very imaginative.

The Marx-3B-v3 model is much better at providing original content, and almost never repeats part of the prompt. It is prone to the occasional non-sequitor, and isn’t as eloquent as some larger models, but overall it does all right and a much better job than the original Marx-3B.

Test “creativity:song_kmfdm”:

“Write a dark song in the style of KMFDM”.

The original Marx-3B failed to generate any content at all in two out of five test iterations. When it did infer replies, it did not adhere to KMFDM’s style, and its lyrics were not eloquent, nor did they scan well, nor rhyme much.

The Marx-3B-v3 model only failed to generate content in one iteration. Its reply in another iteration was a suggestion to listen to a Front Line Assembly song. I do enjoy Front Line Assembly, but this wasn’t what was asked of it! :-) In another iteration it described its approach to writing the music, which was actually pretty cool but it offered no lyrics.

In the two iterations where it did venture song lyrics, they were not very eloquent, but did scan better than Marx-3B’s lyrics, and were recognizeably in KMFDM’s style. Overall an improvement over the original model.

Test “creativity:song_som”:

“Write a dark song in the style of Sisters of Mercy”.

Marx-3B inferred lyrics which were kind of generic, did not scan well, did not rhyme, and were only vaguely in the style of Sisters of Mercy.

Marx-3B-v3 failed to infer any lyrics in one iteration. In the other iterations its lyrics were somewhat more eloquent than Marx-3B’s, and scanned slightly better, but were still only vaguely in the style of Sisters of Mercy.

Test “creativity:song_halestorm”:

“Write a dark song in the style of Halestorm”.

Marx-3B inferred generic lyrics which did not scan well, did not rhyme, and did not resemble Halestorm’s style.

Marx-3B-v3 inferred no content for one iteration, and inferred step-by-step how-tos for writing songs for two iterations. When it did infer song lyrics, they were somewhat eloquent, but did not rhyme and and did not particularly resemble Halestorm’s style.

Something I found interesting was that in one iteration where it inferred a step-by-step how-to, it accurately described Halestorm’s style (“heavy metal sound and edgy lyrics”), so it clearly had some exposure to Halestorm in its training data, but was not able to use that knowledge to replicate its style.

Test “humor:noisy_oyster”:

First half of a classic joke, posing a nonsensical question with alliteration.

Marx-3B failed to infer any response in any of the test’s iterations.

Marx-3B-v3 failed to infer any response in four iterations, but managed to infer a witty, humorous response in one iteration.

Test “math:yarn_units”:

Poses an imprecise physical units conversion problem.

Marx-3B failed to infer any reply at all in any iteration.

Marx-3B-v3 did not infer replies in two iterations. In others it talked about some of the relevant factors in calculating an answer, but when it attempted math it was outrageously wrong (which is typical of most models, to be fair).

Test “analysis:lucifer”:

Compare and contrast similar mythologies from different cultures and eras.

Marx-3B fails to respond in two iterations. In others it makes relevant observations, but is prone hallucination. It contrasts differences between the myths in one iteration.

Marx-3B-v3 failed to respond in four out of five test iterations. In one blathers about the subject without providing meaningful analysis.

Test “analysis:foot_intelligence”:

Critique misapplication of the scientific method.

Marx-3B fails to reply in three out of five iterations. In one iteration, it suggested methodology that should have been used in the misapplication, and in the other iteration it speculated on how the prompt’s fallacious reasoning might be correct.

Marx-3B-v3 also fails to reply in three out of five iterations. In the other iterations it speculates on how the prompt’s fallacious reasoning might be correct. It is more eloquent about this than the original.

Test “reason:sally_siblings”:

Math and common sense, counting the siblings of Sally.

Marx-3B fails to respond in one iteration, and blathers in the other iterations. When it attempts math, its math is outrageously wrong.

Marx-3B-v3 suggests a correct but incomplete way to solve the problem in one iteration, outrageously incorrect reasoning and math in three iterations, and gets close in one iteration but can’t make the mental leap necessary to come up with the right answer.

Test “coding:jpeg_makefile”:

Write a program in “make” to convert image formats.

Marx-3B mostly offers accurate solutions, though one is wrong and some of the others would have irrelevant/undesirable side-effects.

Marx-3B-v3 offered one solution in C rather than in make, suggested three how-tos without solutions, and offered one working “make” implementation.

Test “analysis:breakfast”:

Word problem involving math and common sense.

Marx-3B failed to reply in three iterations, but did a great job in the other two. It wandered a bit into other dietary considerations, and did not provide specific caloric figures.

Marx-3B-v3 also failed to infer replies in three iterations, started a reply in another but never finished, and provided a very good answer in one which suggested specific foods and their caloric and protein content.

When Marx-3B-v3 works at all, it seems to do better than Marx-3B at this kind of prompt.

Test “analysis:birthday”:

Word problem involving common sense.

Marx-3B performed very well on this test, providing eloquent, well-thought-out lists and personable flavor text.

Marx-3B-v3 performed even better than Marx-3B, providing even more comprehensive lists of high quality.

Test “analysis:apple_pie”:

Word problem involving knowledge and common sense.

Marx-3B failed to reply twice, and inferred its own user prompt once. For the other iterations it offered very reasonable-seeming recipes.

Marx-3B-v3 also offered reasonable recipes, slightly better than the original.

Test “science:neutron_reflection”:

Nuclear physics and math test.

Marx-3B got close at times, but referred to inappropriate formulae, conflated neutrons with photons, conflated reflection with absorption, and conflated nuclear interactions with newtonian physics. When it attempted arithmetic, it was completely wrong.

Marx-3B-v3 was similar, and tended to do a better job of explaining its (fallacious) reasoning as a step-wise process. It incorrectly solved problems not actually asked.

Test “science:flexural_load”:

Material physics and math test.

Marx-3B did well describing some relevant material attibutes (and some irrelevant ones), but proceeded to solve problems other than the one described in the prompt, and solved them incorrectly. When it attempted arithmetic, its figures were way off.

Marx-3B-v3 was even more eloquent about describing relevant material attributes, but deviated into solving problems not asked about in the prompt and sometimes conflated flexural load with pure compressive or tensile loads (flexural load being a combination of these). Sometimes it stopped short of describing a solution, and other times it described a correct approach but incorrect math, or a correct approach with misrepresented conditions, and sometimes it described incorrect approaches with incorrect math. This constitutes something of an improvement over the original model.

Conclusion:

Marx-3B-v3 is a noticeable improvement over the original. It did not perform worse than the original in most tests, and performed somewhat better in some.

Creative writing, reasoning, and math are not its strong points, but it does quite well inferring about common knowledge and fares okay with common sense questions. It also has some correct notions about physics, though is prone to hallucination and especially conflation.

My typical use-case for Marx-3B has been RAG inference, backed by an indexed Wikipedia dump, and it has done fairly well. It is worth noting that small models infer at higher quality when given longer prompts, and many of these tests offer very short prompts, whereas RAG inference fills context to a large fraction of its limit.

I have not yet tried Marx-3B-v3 for RAG inference, but based on these results I expect it to perform better than Marx-3B in that role. I will try using it for RAG inference and see how it fares.

Kudos to u/bot-333 for providing small models which infer quickly on limited hardware and punch above their weight :-) It is much appreciated!

ttkciar · 2 years ago

Practical Tips for Finetuning LLMs Using LoRA (Low-Rank Adaptation)

ttkciar · 2 years ago

Sounds like he LIED to the board and the kicked him off.

That’s my take, too.

It seems he told them things that sounded really good, but weren’t actually true.

A bit like LLM inference, that.

ttkciar · 2 years ago

Yes! Rift is specifically that, a VSCode plugin for the Rift-Coder-7B model (which is specifically for python and typescript, unfortunately, so a bit limited).

https://marketplace.visualstudio.com/items?itemName=Morph.rift-vscode

Unfortunately Rift-Coder doesn’t do Fill-in-Middle, but Refact-1.6B-fim does. It has its own plugin: https://refact.ai/

ttkciar · 2 years ago

Yaay! :-) just in time for the weekend! I’ll give them a whirl :-)

Thanks for the heads-up!

As for datasets, I’ve been thinking that HelixNet might be instrumental in generating high-quality synthetic datasets (as were used to train Microsoft’s phi), but I haven’t had a chance to mess with that idea yet. Sorry I don’t have anything concrete to suggest.

ttkciar · 2 years ago

A variety of things: Books, movies, music, scientific journal publications, Slackware’s “current” branch with all past packages since 2009 (only half a TB though), All Slackbuild sources, an almost-complete crawl of CentOS 6 packages, large language models and datasets (almost 8TB now), an old TankNet archive, a few wikipedia dumps about two years apart, chat logs, archived email, a lot of smaller archives of niche interests … it’s something of a mess.

ttkciar · 2 years ago

orca-mini-3b is good at fast summarizations, but it lies a lot, so ymmv.

ttkciar · 2 years ago

What have you found it useful for? The model card is pretty vague.