Skywork-13B: a new foundation model trained on 3.2 trillion tokens

There’s two noteworthy things covered here:

Skywork-13B, a new bilingual foundation model for English and Chinese. They also announce Skywork-13B-Chat enhanced specially for creative writing, Skywork-13B-Math for math, Skywork-13B-MM for multimodal capability, and a segment of their SkyPile Corpus comprising 150 billion tokens of Chinese web text.
Research into pretraining on in-domain data. Specifically, they show that some recent foundation models may be excessively overfitted and have had test data leakage during training.

GitHub and models: https://github.com/SkyworkAI/Skywork/blob/main/README_EN.md

Tech report: https://arxiv.org/abs/2310.19341

Abstract

In this technical report, we present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves state of the art performance in Chinese language modeling on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.

Training loss and validation loss:

Trajectory of important monitoring metrics during Stage-1 pre-training. Stage-1 pre-training consists of two sequential training sessions, represented by different colors in the loss curves (red for session 0 ∼ 2T and blue for session 2 ∼ 3T).

Benchmark evaluation:

https://preview.redd.it/tqvuls0cmixb1.png?width=786&format=png&auto=webp&s=2c339537baaecc8cc8fa3fdd71f44df732cd8674

Pre-training on in-domain data: a common practice?

Important points at a glance:

We evaluate an LLM’s language modeling loss on three datasets drawn from the same distribution: 1) The official GSM8K training set, 2) The official GSM8K test set, 3) A set composed of GSM8K-like samples generated by GPT-4. The corresponding losses are denoted as Ltrain, Ltest, and Lref , respectively. Theoretically, if a language model has not been exposed to any of the three datasets during pre-training, the three losses Ltrain, Ltest, and Lref should be approximately equivalent. However, if the model has been pre-trained on the training set or if the test data has been inadvertently exposed during the pre-training process, we would anticipate a notable discrepancy between Ltrain, Ltest, and Lref .

Models such as ChatGLM3-6B, Baichuan2-13B, Qwen-7B/14B, and Aquila2-34B display markedly lower loss on the training split than on the test split. Consequently, we postulate that these models may have been considerably pre-trained on GSM8K training split or similar data.

We believe that there is valid risk on the practice of targeted pre-training, in that it compromise fairness in benchmarking. While through pre-training on in-domain data a model may excel at specific tasks, it remains uncertain how well it would perform on unseen tasks. Its capabilities may be overestimated based on the benchmark alone, which can lead to unfair comparisons between models and mislead users or stakeholders about the true capabilities of the model.

Regular vs irregular results:

https://preview.redd.it/dll4shngmixb1.png?width=775&format=png&auto=webp&s=0438bab27bf25edcacdbb879279e0959c04b277c

To put this into perspective, QwenLM reports GSM8K 8-shot scores of 16.7 for Llama 2 7B, 29.6 for Llama 2 13B, and 42.2 for Code Llama 34B. From their same chart, Qwen-7B has a score of 51.7, Baichuan-13B comes in at 52.7, and Qwen-14B tops it off with a whopping 61.3.

It reminds me of the paper that came out last week from researchers at Google DeepMind and Princeton. They assessed models using a new evaluation and discerned a wide discrepancy:

A variant of the contamination issue is “cramming for the leaderboard.” It is possible to deliberately train a model on data similar to those used in the leaderboard evaluations. Such datasets are easy to generate from a small number of examples using existing strong models. If “cramming” happens during pre-training, it becomes hard to detect.

Several open models show signs of being over-trained for leaderboards at the expense of general-purpose language capabilities (“cramming”).

As the saying goes, pretraining on the test set is all you need.