Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Large language models (LLMs) are often pre-trained on massive datasets that contain a mixture of text and code. While code is essential in training models designed for programming tasks, it has become increasingly common to include it in the pre-training data of models that are not explicitly intended for code generation.
In a new paper, researchers at Cohere have systematically investigated the impact of code data in LLM pre-training on general performance beyond coding tasks.
“While there has been consensus anecdotally among practitioners that code data plays a vital role in LLMs’ performance, there has been only limited work analyzing the precise impact of code on non-code tasks,” the researchers write.
Their findings show that code plays a crucial role in improving the performance of LLMs on a wide range of tasks. The way they reached those results is also important and can have implications for training LLMs for real-world applications.
Investigating the impact of code
To understand the impact of code on general LLM performance, the researchers conducted a series of experiments. They considered different factors, including the amount of code in the training data, where code is added during the training process, the quality of the code and the scale of the models.
The researchers used a two-phase training process. First, they performed “continued pre-training” where they took pre-trained models and continued to train them on new datasets with different ratios of text and code for a fixed number of tokens. Then they used a “cooldown” phase, where they gave higher weights to higher-quality datasets during the final stages of training.
The baseline model was trained on text only. They also tested models that were pre-trained on either a balanced dataset of code and text first and further trained on text data during the continued pre-training phase. They also had a set of models pre-trained on code-only data and further trained on text.
The researchers evaluated the performance of the models at different scales, from 470 million to 2.8 billion parameters. They used a variety of benchmarks that measured the models’ abilities on world knowledge, natural language reasoning and code performance.
The benefits of code for non-coding tasks
The experiments revealed that code consistently improved the performance of LLMs on non-code-related tasks.
On natural language reasoning tasks, models trained on code consistently outperformed text-only models. Interestingly, the researchers found that pre-training the model with 100% code data led to the best performance on these benchmarks.
“This shows that initialization from a pre-trained model with a mix of code has a strong positive effect on NL reasoning tasks,” the researchers write.
For world knowledge tasks, a balanced mixture of code and text in the pre-training data resulted in the best performance. The researchers suggest that “performance on world knowledge tasks appears to depend on a more balanced data mixture for initialization and a larger proportion of text in the continual pre-training stage.”
On generative tasks, both the code-only and the balanced models outperformed the text-only model, which confirms that code data in the pre-training mix “not only improves reasoning but also helps the model produce better quality generations.”
The researchers also observed that the performance gains from adding code to pre-training data increased with model size. The improvements were most noticeable in world knowledge and code performance, followed by modest gains in natural language reasoning.
“These results show that the trade-off between natural language tasks and code generation increases with the model size,” the researchers write.
It is worth noting that LLMs often exhibit emergent behavior at very large scales, and the trends observed in the study might change at tens or hundreds of billions of parameters. Due to cost limitations, the researchers were not able to test the effects of their experiments at very large scales. However, they are optimistic that their findings will hold true for larger models.
“Given that our findings hold from 470M to 2.8B, we believe they should hold true for larger model sizes and token budgets,” they write.
The researchers also found that adding high-quality synthetic code to the pre-training data significantly boosted performance. This is particularly useful because it doesn’t rely on human-generated code, which is limited in quantity.
“Our synthetic code data was created using problem statements which were used to create Python solutions which were formally verified,” Viraat Aryabumi, Research Scholar at Cohere For AI and lead author of the paper, told VentureBeat. “This is a huge direction of future potential – and the main criteria practitioners should keep in mind if they want to harness synthetic code data is to use a performant teacher model to generate the code data”
They also discovered that adding code-adjacent data, such as GitHub pull requests and commits, could improve the models’ abilities on reasoning tasks.
Incorporating code into the cooldown phase of training led to further improvements in the LLM’s performance on various non-code-related tasks. This finding can be relevant to enterprises, which are more likely to fine-tune models with their data rather than train their own models from scratch.
“The cooldown phase is probably closest to fine-tuning in terms of cost, data quality, and resources needed. It provides large gains, and so regardless of training stage we would recommend including code in the training mix,” Aryabumi said. “We expect including high-quality code (such as those from internal code bases, and code-adjacent data) can provide an improvement during cooldown.”
Given that Cohere is focused on providing LLMs for enterprise applications, it will be interesting to see how these findings affect their future model and product rollouts. For example, they might provide a wider range of pre-trained models on different mixtures of code and text, each geared for different types of tasks. Enterprises can then fine-tune those models on their proprietary data to get the best performance for their specific type of application.
“We expect that the findings of our paper are really relevant to developers and will drive the release of more performant models,” Aryabumi said. “What is surprising about what we find is that code drives performance gains outside of code-tasks, and it is already informing how we think about training state-of-art models we serve.”