Hello!
By popular demand I am planning a fine-tune of https://huggingface.co/dreamgen/opus-v0-7b on top of Yi-34B and wonder whether to use the 200K as the base.
The regular Yi-34B seems slightly better than Yi-34B-200K on standard benchmarks, but I wonder how it “feels” and whether the loss of performance on short context is worth it, given that the regular version can be used up to 32K tokens.
Did anyone try an analysis of these 2 models on various sequence lengths (<4K, <8K, <16K, etc.)?
I am running my story on 200K, feels the same as 4K to me (which I tried in the same setting before 200K was released).
And honestly… Even if it is much worse (and I dont think it worse at all), the mega context is such a boon for storytelling.
What I did not try was 4K stretched out with RoPE alpha or anything like that, but the 200K model does not need any stretching out to at least 42K.