Fine-tune a Mistral-7b model with Direct Preference Optimization

Pre-trained Large Language Models (LLMs) can only perform next-token prediction, making them unable to answer questions. This is why these base models are then fine-tuned on pairs of instructions and answers to act as helpful assistants. However, this process can still be flawed: fine-tuned LLMs can be biased, toxic, harmful, etc. This is where Reinforcement Learning from Human Feedback (RLHF) comes into play.
RLHF provides different answers to the LLM, which are ranked according to a desired behavior (helpfulness, toxicity, etc.). The model learns to output the best answer among these candidates, hence mimicking the behavior we want to instill. Often seen as a way to censor models, this process has recently become popular for improving performance, as shown in neural-chat-7b-v3–1.
In this article, we will create NeuralHermes-2.5, by fine-tuning OpenHermes-2.5 using a RLHF-like technique: Direct Preference Optimization (DPO). For this purpose, we will introduce a preference dataset, describe how the DPO algorithm works, and apply it to our model. We'll see that it significantly improves the performance of the base model on the Open LLM Leaderboard.
As per usual, the code is available on GitHub and Google Colab.
Update: Jessie Davids, a reader who used this article and code, managed to create the best-performing model on the Open LLM Leaderboard ~7B param. Congrats to him!