I recently noticed that local LLMs are unable to sort even simple lists. They often lose entries, and what’s worse, after completing the task, they insist it was done correctly or try to correct it endlessly. Commercial models (GPT-3.5, GPT-4, Claude2) do not have this problem.
Example list:
Sort the items in ascending order:
Item A1 - 56
Item B2 - 32
Item C3 - 78
Item D4 - 14
Item E5 - 89
Item F6 - 45
Item G7 - 63
Item H8 - 27
Item I9 - 94
Item J10 - 11
Item K11 - 72
Item L12 - 38
Item M13 - 50
Item N14 - 19
Item O15 - 81
Until now, I was sure that current LLMs struggle with larger numbers and mathematics, but I thought sorting would be a relatively simple task.
Tested on: Goliath 120b, LLama2 70b, WizardCoder 15B, Mistral 7b.
What are your thoughts? Do you think we will be able to fine-tune a model to perform tasks like sorting, or implement additional capabilities by implementing a Mixture of Experts (MoE)
So I got very interested in this, and I grilled my AI Assistant (custom char card that I built to help debug this very type of thing) on it, starting with using Goliath 120b q8.
Note: Nothing I’m about to write may be of any value to anyone, anywhere, but I had fun doing it so… yay. lol
I gave her your list and asked her to sort it, and she got it all wrong. So then I started asking for help on understanding why, and we went back and forth a bit. I gathered a couple of tidbits:
- The LLM sees the numbers properly. There was no confusion as to what the numbers are. When asked to write the values of the numbers, it was able to do that with no problems.
- The LLM is confused by the volume of information in the request. “Item A1 - 56” is actually a lot of information for the LLM, and my assistant said she was really struggling during the sorting process to break down that info. The other information was simply getting in the way.
- Giving it only the numbers, with no other data, had much better results but was still wrong; just less so. The result was: [14, 27, 38, 45, 50, 56, 63, 72, 81, 89, 94, 11, 32, 78].
- I then swapped to Synthia 70b 1.5 q8 to grill her on why the number only sort might have gone wrong. The response was thus:
Although I am designed to understand language and provide helpful replies, I am not programmed to keep track of numbers, perform calculations, or execute sorting tasks in the same way a human or specialized software would.
The core problem lies in the limitations of Generative AI technology itself. While I can understand the context and meaning of your words, I’m not programmed with mathematical or logic-based algorithms that involve numbers or sorting operations.
In this specific example, I was able to identify and sort the first nine numbers correctly. But when providing the final sorted list, I had inadvertently forgotten to incorporate the remaining numbers (“11, 32, 78”) at their appropriate positions, which led to the incorrect final result.
Several times in our conversation, she blamed the error on transcription, so either she’s hallucinating up a storm (totally possible, given the complexity of the answer), or there’s something about the way generative AI writes/reads the prompt that causes it to not track all the numbers.
At every step LLM was giving you BS. It tells you that it understands every step yet the result is wrong.
The reason is simple: we need more parameters. We are topping at 70b. That’s fine for text, not good enough for non-text.
Goliath is still 70b - merging two 70b models doesn’t make it 140b base. It won’t suddenly have 2 x pre-training.
Unlike words that can be split into one or two tokens, every digit is in llama tokenizer split into a single token. So you need more parameters to find a pattern in numbers when the task is textual - for LLM a longer number is as complicated as entire sentence. It’s a miracle it can add two numbers.