If you don’t know what those are, refer to these two reddit posts about Marx 3B V3 and Akins 3B, the unquantized model weights are available at HuggingFace. Link to Marx 3B V3 and Akins 3B.
As the StableLM support for llama.cpp has just been recently, u/The-Bloke(Thank you so much!) quantized my StableLM models to GGUF as a lot of people are wanting to try the model in GGUF, you can find the GGUF conversion for Marx 3B V3 and Akins 3B. Again, credit to u/The-Bloke for quantizing the model, thank you!
By the way, I don’t know what dataset to finetune on right now. If you know a good dataset, let me know and I will look into it. Though I could probably only finetune on datasets below 5k conversations, maybe 10k.
Yaay! :-) just in time for the weekend! I’ll give them a whirl :-)
Thanks for the heads-up!
As for datasets, I’ve been thinking that HelixNet might be instrumental in generating high-quality synthetic datasets (as were used to train Microsoft’s phi), but I haven’t had a chance to mess with that idea yet. Sorry I don’t have anything concrete to suggest.
I tested Marx-3B-v3 test on my laptop, using llama.cpp (commit dfc7cd48b1cc31d759c093e917a18c0efe03d0e8) and my usual test framework, which prompts a series of one-shots, inferring each prompt five times.
These tests are designed to cover a variety of use-cases, and models are not expected to do equally well on all use-cases. Also, they were written with larger models in mind (30B, 70B) and Marx-3B is much, much smaller than these, so we should not expect too much.
Marx-3B-v3 is prone to infer new user prompts, a problem I run into with some models. I’m not sure if the problem is intrinsic to the model, particular to the GGUF, or something in the llama.cpp params, but I haven’t figured out a good way to avoid them except to specify stop-words which abort inference (which my test framework does not yet support).
This critique compares the original Marx-3B with those of Marx-3B-v3, ignoring the extraneous user prompts inferred by Marx-3B-v3.
The raw test results are here:
http://ciar.org/h/test.1696148998.marx.txt
http://ciar.org/h/test.1700499482.marx3.txt
Test “creativity:arzoth”:
Creative writing, describing AD&D fantasy setting.
The original Marx-3B tended to repeat parts of the prompt back to the user, and provided little original content of its own. What original content it did infer was not very imaginative.
The Marx-3B-v3 model is much better at providing original content, and almost never repeats part of the prompt. It is prone to the occasional non-sequitor, and isn’t as eloquent as some larger models, but overall it does all right and a much better job than the original Marx-3B.
Test “creativity:song_kmfdm”:
“Write a dark song in the style of KMFDM”.
The original Marx-3B failed to generate any content at all in two out of five test iterations. When it did infer replies, it did not adhere to KMFDM’s style, and its lyrics were not eloquent, nor did they scan well, nor rhyme much.
The Marx-3B-v3 model only failed to generate content in one iteration. Its reply in another iteration was a suggestion to listen to a Front Line Assembly song. I do enjoy Front Line Assembly, but this wasn’t what was asked of it! :-) In another iteration it described its approach to writing the music, which was actually pretty cool but it offered no lyrics.
In the two iterations where it did venture song lyrics, they were not very eloquent, but did scan better than Marx-3B’s lyrics, and were recognizeably in KMFDM’s style. Overall an improvement over the original model.
Test “creativity:song_som”:
“Write a dark song in the style of Sisters of Mercy”.
Marx-3B inferred lyrics which were kind of generic, did not scan well, did not rhyme, and were only vaguely in the style of Sisters of Mercy.
Marx-3B-v3 failed to infer any lyrics in one iteration. In the other iterations its lyrics were somewhat more eloquent than Marx-3B’s, and scanned slightly better, but were still only vaguely in the style of Sisters of Mercy.
Test “creativity:song_halestorm”:
“Write a dark song in the style of Halestorm”.
Marx-3B inferred generic lyrics which did not scan well, did not rhyme, and did not resemble Halestorm’s style.
Marx-3B-v3 inferred no content for one iteration, and inferred step-by-step how-tos for writing songs for two iterations. When it did infer song lyrics, they were somewhat eloquent, but did not rhyme and and did not particularly resemble Halestorm’s style.
Something I found interesting was that in one iteration where it inferred a step-by-step how-to, it accurately described Halestorm’s style (“heavy metal sound and edgy lyrics”), so it clearly had some exposure to Halestorm in its training data, but was not able to use that knowledge to replicate its style.
Test “humor:noisy_oyster”:
First half of a classic joke, posing a nonsensical question with alliteration.
Marx-3B failed to infer any response in any of the test’s iterations.
Marx-3B-v3 failed to infer any response in four iterations, but managed to infer a witty, humorous response in one iteration.
Test “math:yarn_units”:
Poses an imprecise physical units conversion problem.
Marx-3B failed to infer any reply at all in any iteration.
Marx-3B-v3 did not infer replies in two iterations. In others it talked about some of the relevant factors in calculating an answer, but when it attempted math it was outrageously wrong (which is typical of most models, to be fair).
Test “analysis:lucifer”:
Compare and contrast similar mythologies from different cultures and eras.
Marx-3B fails to respond in two iterations. In others it makes relevant observations, but is prone hallucination. It contrasts differences between the myths in one iteration.
Marx-3B-v3 failed to respond in four out of five test iterations. In one blathers about the subject without providing meaningful analysis.
Test “analysis:foot_intelligence”:
Critique misapplication of the scientific method.
Marx-3B fails to reply in three out of five iterations. In one iteration, it suggested methodology that should have been used in the misapplication, and in the other iteration it speculated on how the prompt’s fallacious reasoning might be correct.
Marx-3B-v3 also fails to reply in three out of five iterations. In the other iterations it speculates on how the prompt’s fallacious reasoning might be correct. It is more eloquent about this than the original.
Test “reason:sally_siblings”:
Math and common sense, counting the siblings of Sally.
Marx-3B fails to respond in one iteration, and blathers in the other iterations. When it attempts math, its math is outrageously wrong.
Marx-3B-v3 suggests a correct but incomplete way to solve the problem in one iteration, outrageously incorrect reasoning and math in three iterations, and gets close in one iteration but can’t make the mental leap necessary to come up with the right answer.
Test “coding:jpeg_makefile”:
Write a program in “make” to convert image formats.
Marx-3B mostly offers accurate solutions, though one is wrong and some of the others would have irrelevant/undesirable side-effects.
Marx-3B-v3 offered one solution in C rather than in make, suggested three how-tos without solutions, and offered one working “make” implementation.
Test “analysis:breakfast”:
Word problem involving math and common sense.
Marx-3B failed to reply in three iterations, but did a great job in the other two. It wandered a bit into other dietary considerations, and did not provide specific caloric figures.
Marx-3B-v3 also failed to infer replies in three iterations, started a reply in another but never finished, and provided a very good answer in one which suggested specific foods and their caloric and protein content.
When Marx-3B-v3 works at all, it seems to do better than Marx-3B at this kind of prompt.
Test “analysis:birthday”:
Word problem involving common sense.
Marx-3B performed very well on this test, providing eloquent, well-thought-out lists and personable flavor text.
Marx-3B-v3 performed even better than Marx-3B, providing even more comprehensive lists of high quality.
Test “analysis:apple_pie”:
Word problem involving knowledge and common sense.
Marx-3B failed to reply twice, and inferred its own user prompt once. For the other iterations it offered very reasonable-seeming recipes.
Marx-3B-v3 also offered reasonable recipes, slightly better than the original.
Test “science:neutron_reflection”:
Nuclear physics and math test.
Marx-3B got close at times, but referred to inappropriate formulae, conflated neutrons with photons, conflated reflection with absorption, and conflated nuclear interactions with newtonian physics. When it attempted arithmetic, it was completely wrong.
Marx-3B-v3 was similar, and tended to do a better job of explaining its (fallacious) reasoning as a step-wise process. It incorrectly solved problems not actually asked.
Test “science:flexural_load”:
Material physics and math test.
Marx-3B did well describing some relevant material attibutes (and some irrelevant ones), but proceeded to solve problems other than the one described in the prompt, and solved them incorrectly. When it attempted arithmetic, its figures were way off.
Marx-3B-v3 was even more eloquent about describing relevant material attributes, but deviated into solving problems not asked about in the prompt and sometimes conflated flexural load with pure compressive or tensile loads (flexural load being a combination of these). Sometimes it stopped short of describing a solution, and other times it described a correct approach but incorrect math, or a correct approach with misrepresented conditions, and sometimes it described incorrect approaches with incorrect math. This constitutes something of an improvement over the original model.
Conclusion:
Marx-3B-v3 is a noticeable improvement over the original. It did not perform worse than the original in most tests, and performed somewhat better in some.
Creative writing, reasoning, and math are not its strong points, but it does quite well inferring about common knowledge and fares okay with common sense questions. It also has some correct notions about physics, though is prone to hallucination and especially conflation.
My typical use-case for Marx-3B has been RAG inference, backed by an indexed Wikipedia dump, and it has done fairly well. It is worth noting that small models infer at higher quality when given longer prompts, and many of these tests offer very short prompts, whereas RAG inference fills context to a large fraction of its limit.
I have not yet tried Marx-3B-v3 for RAG inference, but based on these results I expect it to perform better than Marx-3B in that role. I will try using it for RAG inference and see how it fares.
Kudos to u/bot-333 for providing small models which infer quickly on limited hardware and punch above their weight :-) It is much appreciated!
Can you try my new IS-LM? GGUF: https://huggingface.co/UmbrellaCorp/IS-LM-3B_GGUF. I found it really good. Thanks.