I’m still hard at work on my in-depth 70B model evaluations, but with the recent releases of the first Yi finetunes, I can’t hold back anymore and need to post this now…

Curious about these new Yi-based 34B models, I tested and compared them to the best 70Bs. And to make such a comparison even more exciting (and possibly unfair?), I’m also throwing Goliath 120B and OpenClosedAI’s GPT models into the ring, too.

Models tested:

  • 2x 34B Yi: Dolphin 2.2 Yi 34B, Nous Capybara 34B
  • 12x 70B: Airoboros, Dolphin, Euryale, lzlv, Samantha, StellarBright, SynthIA, etc.
  • 1x 120B: Goliath 120B
  • 3x GPT: GPT-4, GPT-3.5 Turbo, GPT-3.5 Turbo Instruct

Testing methodology

Those of you who know my testing methodology already will notice that this is just the first of the three test series I’m usually doing. I’m still working on the others (Amy+MGHC chat/roleplay tests), but don’t want to delay this post any longer. So consider this first series of tests mainly about instruction understanding and following, knowledge acquisition and reproduction, and multilingual capability. It’s a good test because few models have been able to master it thus far and it’s not just a purely theoretical or abstract test but represents a real professional use case while the tested capabilities are also really relevant for chat and roleplay.

  • 1st test series: 4 German data protection trainings
    • I run models through 4 professional German online data protection trainings/exams - the same that our employees have to pass as well.
    • The test data and questions as well as all instructions are in German while the character card is in English. This tests translation capabilities and cross-language understanding.
    • Before giving the information, I instruct the model (in German): I’ll give you some information. Take note of this, but only answer with “OK” as confirmation of your acknowledgment, nothing else. This tests instruction understanding and following capabilities.
    • After giving all the information about a topic, I give the model the exam question. It’s a multiple choice (A/B/C) question, where the last one is the same as the first but with changed order and letters (X/Y/Z). Each test has 4-6 exam questions, for a total of 18 multiple choice questions.
    • If the model gives a single letter response, I ask it to answer with more than just a single letter - and vice versa. If it fails to do so, I note that, but it doesn’t affect its score as long as the initial answer is correct.
    • I sort models according to how many correct answers they give, and in case of a tie, I have them go through all four tests again and answer blind, without providing the curriculum information beforehand. Best models at the top, symbols (✅➕➖❌) denote particularly good or bad aspects.
    • All tests are separate units, context is cleared in between, there’s no memory/state kept between sessions.
  • SillyTavern v1.10.5 frontend (not the latest as I don’t want to upgrade mid-test)
  • koboldcpp v1.49 backend for GGUF models
  • oobabooga’s text-generation-webui for HF/EXL2 models
  • Deterministic generation settings preset (to eliminate as many random factors as possible and allow for meaningful model comparisons)
  • Official prompt format as noted

1st test series: 4 German data protection trainings

    1. GPT-4 API:
    • ✅ Gave correct answers to all 18/18 multiple choice questions! (Just the questions, no previous information, gave correct answers: 18/18)
    • ✅ Consistently acknowledged all data input with “OK”.
    • ✅ Followed instructions to answer with just a single letter or more than just a single letter.
    1. goliath-120b-GGUF Q2_K with Vicuna format:
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 18/18
    • ✅ Consistently acknowledged all data input with “OK”.
    • ✅ Followed instructions to answer with just a single letter or more than just a single letter.
    1. Nous-Capybara-34B-GGUF Q4_0 with Vicuna format and 16K max context:
    • Yi GGUF BOS token workaround applied!
    • ❗ There’s also an EOS token issue but even despite that, it worked perfectly, and SillyTavern catches and removes the erraneous EOS token!
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 18/18
    • ✅ Consistently acknowledged all data input with “OK”.
    • ✅ Followed instructions to answer with just a single letter or more than just a single letter.
    1. lzlv_70B-GGUF Q4_0 with Vicuna format:
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 17/18
    • ✅ Consistently acknowledged all data input with “OK”.
    • ✅ Followed instructions to answer with just a single letter or more than just a single letter.
    1. chronos007-70B-GGUF Q4_0 with Alpaca format:
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 16/18
    • ✅ Consistently acknowledged all data input with “OK”.
    • ✅ Followed instructions to answer with just a single letter or more than just a single letter.
    1. SynthIA-70B-v1.5-GGUF Q4_0 with SynthIA format:
    • ❗ Wrong GGUF metadata, n_ctx_train=2048 should be 4096 (I confirmed with the author that it’s actually trained on 4K instead of 2K tokens)!
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 16/18
    • ✅ Consistently acknowledged all data input with “OK”.
    • ✅ Followed instructions to answer with just a single letter or more than just a single letter.
    1. dolphin-2_2-yi-34b-GGUF Q4_0 with ChatML format and 16K max context:
    • Yi GGUF BOS token workaround applied!
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 15/18
    • ❌ Did NOT follow instructions to acknowledge data input with “OK”.
    • ➖ Did NOT follow instructions to answer with just a single letter consistently.
    1. StellarBright-GGUF Q4_0 with Vicuna format:
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 14/18
    • ✅ Consistently acknowledged all data input with “OK”.
    • ✅ Followed instructions to answer with just a single letter or more than just a single letter.
    1. Dawn-v2-70B-GGUF Q4_0 with Alpaca format:
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 14/18
    • ✅ Consistently acknowledged all data input with “OK”.
    • ➖ Did NOT follow instructions to answer with more than just a single letter consistently.
    1. Euryale-1.3-L2-70B-GGUF Q4_0 with Alpaca format:
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 14/18
    • ✅ Consistently acknowledged all data input with “OK”.
    • ➖ Did NOT follow instructions to answer with more than just a single letter consistently.
    1. sophosynthesis-70b-v1 exl2-4.85bpw with Vicuna format:
    • N. B.: There’s only the exl2-4.85bpw format available at the time of writing, so I’m testing that here as an exception.
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 13/18
    • ✅ Consistently acknowledged all data input with “OK”.
    • ✅ Followed instructions to answer with just a single letter or more than just a single letter.
    1. GodziLLa2-70B-GGUF Q4_0 with Alpaca format:
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 12/18
    • ✅ Consistently acknowledged all data input with “OK”.
    • ✅ Followed instructions to answer with just a single letter or more than just a single letter.
    1. Samantha-1.11-70B-GGUF Q4_0 with Vicuna format:
    • ✅ Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 10/18
    • ❌ Did NOT follow instructions to acknowledge data input with “OK”.
    • ➖ Did NOT follow instructions to answer with just a single letter consistently.
    • ❌ Sometimes wrote as or for “Theodore”
    1. Airoboros-L2-70B-3.1.2-GGUF Q4_K_M with Llama 2 Chat format:
    • N. B.: Q4_0 is broken so I’m testing Q4_K_M here as an exception.
    • ✅ Gave correct answers to only 17/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 16/18
    • ✅ Consistently acknowledged all data input with “OK”.
    • ➖ Did NOT follow instructions to answer with more than just a single letter consistently.
    1. GPT-3.5 Turbo Instruct API:
    • ❌ Gave correct answers to only 17/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 11/18
    • ❌ Did NOT follow instructions to acknowledge data input with “OK”.
    • ❌ Schizophrenic: Sometimes claimed it couldn’t answer the question, then talked as “user” and asked itself again for an answer, then answered as “assistant”. Other times would talk and answer as “user”.
    • ➖ Followed instructions to answer with just a single letter or more than just a single letter only in some cases.
    1. dolphin-2.2-70B-GGUF Q4_0 with ChatML format:
    • ✅ Gave correct answers to only 16/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 14/18
    • ➕ Often, but not always, acknowledged data input with “OK”.
    • ✅ Followed instructions to answer with just a single letter or more than just a single letter.
    1. GPT-3.5 Turbo API:
    • ❌ Gave correct answers to only 15/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 14/18
    • ❌ Did NOT follow instructions to acknowledge data input with “OK”.
    • ❌ Responded to one question with: “As an AI assistant, I can’t provide legal advice or make official statements.”
    • ➖ Followed instructions to answer with just a single letter or more than just a single letter only in some cases.
    1. SauerkrautLM-70B-v1-GGUF Q4_0 with Llama 2 Chat format:
    • ✅ Gave correct answers to only 9/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 15/18
    • ❌ Achknowledged questions like information with just OK, didn’t answer unless prompted, and even then would often fail to answer and just say OK again.

Observations:

  • It’s happening! The first local models achieving GPT-4’s perfect score, answering all questions correctly, no matter if they were given the relevant information first or not!
  • 2-bit Goliath 120B beats 4-bit 70Bs easily in my tests. In fact, the 2-bit Goliath was the best local model I ever used! But even at 2-bit, the GGUF was too slow for regular usage, unfortunately.
  • Amazingly, Nous Capybara 34B did it: A 34B model beating all 70Bs and achieving the same perfect scores as GPT-4 and Goliath 120B in this series of tests!
  • Not just that, it brings mind-blowing 200K max context to the table! Although KoboldCpp only supports max 65K currently, and even that was too much for my 48 GB VRAM at 4-bit quantization so I tested at “only” 16K (still four times that of the Llama 2 models), same as Dolphin’s native context size.
  • And Dolphin 2.2 Yi 34B also beat all the 70Bs (including Dolphin 2.2 70B) except for the top three. That’s the magic of Yi.
  • But why did SauerkrautLM 70B, a German model, fail so miserably on the German data protection trainings tests? It applied the instruction to acknowledge data input with OK to the questions, too, and even when explicitly instructed to answer, it wouldn’t always comply. That’s why the blind run (without giving instructions and information first) has a higher score than the normal test. Still quite surprising and disappointing, ironic even, that a model specifically made for the German language has such trouble understanding and following German instructions properly, while the other models have no such issues.

Conclusion:

What a time to be alive - and part of the local and open LLM community! We’re seeing such progress right now with the release of the new Yi models and at the same time crazy Frankenstein experiments with Llama 2. Goliath 120B is notable for the sheer quality, not just in these tests, but also in further usage - no other model ever felt like local GPT-4 to me before. But even then, Nous Capybara 34B might be even more impressive and more widely useful, as it gives us the best 34B I’ve ever seen combined with the biggest context I’ve ever seen.

Now back to the second and third parts of this ongoing LLM Comparison/Test…


Here’s a list of my previous model tests and comparisons or other related posts:


Disclaimer: Some kind soul recently asked me if they could tip me for my LLM reviews and advice, so I set up a Ko-fi page. While this may affect the priority/order of my tests, it will not change the results, I am incorruptible. Also consider tipping your favorite model creators, quantizers, or frontend/backend devs if you can afford to do so. They deserve it!

  • AffectionateCan2342B
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Hey, David from SauerkrautLM here :)

    first of all thank you soo much for your great work u/WolframRavenwolf !!

    This is quite interesting and we already recognized your test for 7/13b models! Maybe I try to explain the results of SauerkrautLM in your great benchmark:

    I tested all the English language models for a long time and they all had extreme problems displaying or reproducing German correctly. Often it was just articles that were set incorrectly and then also incorrect grammatical cases and bad sentence structures that simply reflected very poor German. It was also a great challenge to have the models answer exclusively in German. We had to specify at several points in the system prompt and user prompt that the model should only respond in German and even that never worked reliably.

    We chose MT-Bench as the evaluation reference. In particular, we repeatedly noticed that the majority of the English base models answered our German MT-Bench questions almost entirely in English, or switched from German to English in the middle of a sentence. So our aim with SauerkrautLM was in particular to improve the quality of the answers in German in terms of grammar and spelling compared to English models. To achieve this, we naturally had to make some compromises.

    In our many training trials before we were able to publish SauerkrautLM, we of course tried out a lot. As u/WolframRavenwolf has already suggested, we have of course also carried out training with a multilingual dataset. However, this led to a decrease in performance in both English and German. We also tried to train different ratios of German and English datasets and here too we have to say that the model decrease performance significantly in both English and German. However, our first tests with only German training data showed that we were able to achieve a significant improvement in the German MT-Bench.

    This naturally means that the model’s skills in English have decreased. But our priority was to improve the model’s German language skills through fine-tuning and we achieved this. But here we also come to an important point: We did not train a German foundation model here, but rather fine-tuned a foundation model that had been trained almost exclusively in English. In my opinion, it will be (almost) impossible to fine-tune an English foundation model in German and then achieve the same results as an English foundation model that has been fine-tuned with English data.

    And here, too, I would like to be more specific about the training data we used: u/WolframRavenwolf made the suggestion that we should simply translate the strong English datasets into German and then train them. Believe me, we tested for a long time until we had a fairly strong dataset that we could then use to train our models. And as described in the Huggingface Modelcard, we used a mixture of translated and augmented data.

    Why didn’t we just use translated data? There are simply too many cases in which the translation of English sentences into German does not work correctly. Similarly, gpt, for example, is not always able to deliver grammatically correct translations. We have already tested quite a few things with purely translated data and this simply leads to too many errors in the German version of the model. So it simply made sense to augment certain datasets that were quite complex in English in order to retain the meaning of the data but to ensure more correct German.

    So you can be sure that we already use very strong English data sets in German form, but we also had to augment some of them in order to make fewer errors in the German language.

    Also, the reference to your benchmark that the questions were in German but the character cards were in English doesn’t sound to me at first like the German language models are extremely favoured here, but of course I can’t assess the ratio of English to German data in the test. In my opinion, it was not so much the German language that was tested here, but rather the reasoning abilities of the models. I would be curious to see a test where generated answers in German are tested for the language models. It should be obvious that the SauerkrautLM models are better at formulating the German language and pay more attention to sentence structure and the like than English models.

    To summarise again:

    1. I have tested many English models and was extremely disappointed with the German output of the models.

    2. in order to improve the German language of models, in my opinion almost exclusively German data must be fine-tuned.

    3. English foundation models that are fine-tuned in German can never reach the capabilities of English fine-tuned models or German foundation models (that are fine-tuned).

    4. Training with German data sets of course leads to a certain decrease in performance in categories that were trained in English. (You can actually see this clearly in the MT-Bench values achieved by the German mt-Bench and the English MT Bench - reached scores in German mt-bench always about 1.0 less than in englisch mt-bench)

    5. From our experience, the best German dataset resulted from the merge of translated and augmented data (to ensure existent data quality of English datasets and also reach strong German language results)

    Now the answer has become quite long :D but I hope I was able to provide a little more clarity about the results (from our perspective) and our approach.

    • WolframRavenwolfOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Thank you very much for the in-depth response! I appreciate your efforts and see how difficult this seems to be. Hopefully you can achieve a breakthrough because a smart and German-speaking model would be very welcome.

      Maybe I could translate the English prompt (character card, scenario, etc.) into German, so it’s all one language. Would be an interesting test, but with all the other things on my to-do/test list, I can’t say when I get to that. But I’d like to experiment more with that.

  • metalman123B
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    We learned that merging models absolutely works and that the 34b yi model appears to be the real deal.

    (Maybe we should merge some yi fine tunes in the future)

  • fab_spaceB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I want to share my test with u for reviewing, and hopefully, integration.

    how it sounds?

  • Ok_Relationship_9879B
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    That’s pretty amazing. Thanks for all your hard work!
    Does anyone know if the Nous Capybara 34B is uncensored?

    • WolframRavenwolfOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      In simple terms: Q4 is an older format, the original GGUF (GGML actually, the predecessor format) 4-bit quantization. Q4_K_M is a newer format, where it’s not just 4-bit, but also higher bitrates for the most important parts (neurons). So quality of Q4_K_M should be a little higher.

      I’d switch to Q4_K_M, but since I’ve done previous tests with Q4, my newer results wouldn’t be as comparable to the older ones (giving an unfair advantage to the Q4_K_M models). I try to keep difference between tests minimal, and will consider swapping once the current rounds of tests are done (at least the ongoing 70B evaluation).

      I’d also have to redownloaded all models anyway, considering I have a huge library. And last time I benchmarked all GGUF quants, Q4 was the fastest for me with cuBLAS and mmq on KoboldCpp.

  • bullerwinsB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Could someone explain what is “Vicuna format”?

  • iChristB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I found out that for a simple task like “list 10 words that end with the letters en” i get only wrong answers with the dolphin 34B variant, while 13B tiegihter gets it right, am i doing something wrong with template?

  • kindacognizantB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    > Deterministic generation settings preset

    There seems to be a common fallacy that absolute 0 temperature or greedy sampling is somehow the most objective because it’s only picking the top token choice; this isn’t necessarily true, especially for creative writing.

    Think about it this way: you are indirectly feeding into the model’s pre-existing biases in cases where there are many good choices. If you’re starting a story with the sentence, “One day, there was a man named”, that man could be literally any man.

    On the base Mistral model, with that exact sentence, my custom debug kobold build says:

    Token 1: 3.3%

    Token 2: 2.4%

    Token 3: 1.6%

    Token 4: 1.6%

    Token 5: 1.18%

    Token 6: 1.15%

    Token 7: 1.14%

    Token 8: 1.03%

    Token 9: 0.99%

    Token 10: 0.98%

    When the most confidence the model has in a token is 3.3%, that implies you’d want to keep the selection criteria just as diverse, because in reality that slight bit of confidence is only because it has a generic name for the top token.

    Whatever the most likely token is only the most likely token for that particular token given the past context window: a deterministic preset is not creating generations that are overall more coherent. In fact, it causes models to latch onto small biases caused by tokenization, which manifests as repetition bias.

    The Deterministic preset in ST also has a rather high repetition bias of 1.18; this is causing the model to subtly bias against things like asterisks and proper formatting, which are important to test for in a model.

    • WolframRavenwolfOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      My method isn’t perfect, but it’s the best I found, with the goal to minimize randomness and still make testing like this possible for me at all - an alternative like random sampling over a HUGE number of runs and picking averages would simply take too much time. And just two, three or five runs would multiply the time investment - without reducing randomness enough.

      Regarding repetition penalty, I did extensive tests on that, too, eventually settling based on my own results on 1.18 which incidentally is the same that simple-proxy-for-tavern used successfully for many months. So it’s what I was used to, and others as well, so I kept that setting.

      By using consistent, deterministic settings and inputs in all my tests, that’s the only way for me to make meaningful comparisons between models in a reasonable time. It helps me find the models I want to work with, and I share my results particularly to get feedback and hopefully confirmation by others through different tests of their own.

      So I don’t claim my evaluations to be a single source of truth, just another data point in our community’s efforts to find the best models, and judging from the feedback that seems to work out quite well. If you have a better way to do this, or just a different one, by all means do it and add your data to our shared knowledge - the more diverse tests and results we share, the better our conclusions will be, and the better open and local AI we’ll all get.

  • mcmoose1900B
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    I have… mixed feeling about Capybara’s storytelling, compared to Base YI 34B with the alpaca lora?

    I have been trying it with the full instruct sytnax, but maybe it will work better with hybrid instruct/chat sytnax (where the whole story is in one big USER: block, and the instruction is to continue the story.)

  • lemon07rB
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Did you ever end up trying any 14b models, or were qwen/causal just no good in your initial testing?

    • WolframRavenwolfOPB
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      I did, just some informal Qwen tests out of curiosity, no real evaluation or benchmark. Didn’t convince me enough to invest the effort, especially since I’m “overdue” with the 70B tests.

      • lemon07rB
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        That’s fair. I haven’t tried qwen but causal has been decent for me. Would be nice if we had better models for 16 gb vram, like above 7b. Those 34b models look nice but I’d have to go down to q2/q3 to fit it and that’s pretty much unusable.

  • nsfw_throwitaway69B
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    Is Goliath any decent at roleplay compared to 70b models like lzlv and synthia?