So Mistral-7b is a pretty impressive 7B param model … but why is it so capable? Do we have any insights into its dataset? Was it trained very far beyond the scaling limit? Any attempts at open reproductions or merges to scale up # of params?

  • PookaMacPhellimenB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Lack of censorship is a key factor as it maximises the predictive abilities of the model.

  • meetraisB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    I second this. Mistral-7B gave me good results. After fine-tuning it’s result is even better.

    • kaszebeB
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Mistral-7B gave me good results

      Can you expand upon that? Do you mean in terms of its ability to write at a college level without major grammatical errors?

  • CharuruB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    The results are okay, but I’m hard-pressed to call it “very capable”. My perspective on it is that other bigger models are making mistakes they shouldn’t be making because they were “trained wrong”.

    • Monkey_1505B
      link
      fedilink
      English
      arrow-up
      1
      ·
      10 months ago

      Knowledge is a strange goal for any model when we have the internet. IMO. Just connect your model to a web search.

  • DorialexandreB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    My current hunch is that they use a lot of non easily accessible online ressources (including a specific archive owned by someone named Anna).

  • cleverestxB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Why can we get a 20 - 34b version of this very capable Mistral?

  • Monkey_1505B
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    Having used it a lot, I can say for sure that without much prompting it readily produces junk web text, urls etc, so it is not a fully filtered or fully synthetic dataset.

    My guess would be that it’s just ‘a bit better filtered than llama-2’, and maybe slightly more trained on that set. Slightly better quality set, slightly more trained on that set.

    My intuition based on this, is that per parameter size EVERYTHING open source could be optimized considerably more.

  • FPhamB
    link
    fedilink
    English
    arrow-up
    1
    ·
    10 months ago

    It’s simply the time bonus - coming after all the big models.

    - better filtering - kill outright junk

    - you use already big models (OpenAI and LLama) that you can use for data tuning and filtering

    - use available synthetic data