Hi, everyone. Xwin-Math is intended to promote the mathematical reasoning capabilities of LLMs. Now we release the first version, which is a series of Llama 2 SFT models with CoT prompt.

GitHub link: Xwin-LM/Xwin-Math at main · Xwin-LM/Xwin-LM (github.com)

Model link: Xwin-LM (Xwin-LM) (huggingface.co)

Gradio Demo: Gradio (70B model)

Math capability on GSM8K and MATH benchmark

The Xwin-Math-70B-V1.0 model achieves 31.8 pass@1 on MATH benchmark and 87.0 pass@1 on GSM8K benchmark. This performance places it first amongst all open-source CoT models.

The Xwin-Math-7B-V1.0 and Xwin-Math-13B-V1.0 models achieve 66.6 and 76.2 pass@1 on GSM8K benchmark, ranking as top-1 among all LLaMA-2 based 7B and 13B open-source models, respectively.

We also evaluate Xwin-Math on other benchmarks such as SVAMP and MAWPS. Xwin-Math-70B-V1.0 approaches or surpasses the performance of GPT-35-Turbo (8-shot) on most benchmarks.

In addition, it also includes an evaluation toolkit that better converts LaTeX formulas into SymPy objects, enabling more accurate assessment of the mathematical abilities. We found that due to evaluation constraints, the results of GPT-4 were previously underestimated.

More information can be found in our GitHub repo. We SFT on Llama 2 with standard setting, using GPT-4 to augment the training set of MATH and GSM8K to approximately 100K in total. Our paper is still in the progress, so more training details and further results will be updated soon.

Any suggestions or comments greatly welcome! Thanks! =)