Hello!
By popular demand I am planning a fine-tune of https://huggingface.co/dreamgen/opus-v0-7b on top of Yi-34B and wonder whether to use the 200K as the base.
The regular Yi-34B seems slightly better than Yi-34B-200K on standard benchmarks, but I wonder how it “feels” and whether the loss of performance on short context is worth it, given that the regular version can be used up to 32K tokens.
Did anyone try an analysis of these 2 models on various sequence lengths (<4K, <8K, <16K, etc.)?
Random update on this, I did some more experimenting on the start of a story (with LimaRP and Petrol LoRAs), and the 4K model seems… fine? So does the 200K.
I don’t how know to stretch out the base model. Their page claims it supports 32K, but it has a 4K context in the config and no RoPE scaling section. Just a high rope theta.
The one difference I did notice is that the 200K model really likes to summarize and reference previous parts of the story. Maybe it was trained on retrieval or summarization examples.