Fine-tune a Mistral-7b model with Direct Preference Optimization

Author:Murphy | View: 21370 | Time: 2025-03-22 23:31:38

Pre-trained Large Language Models (LLMs) can only perform next-token prediction, making them unable to answer questions. This is why these base models are then fine-tuned on pairs of instructions and answers to act as helpful assistants. However, this process can still be flawed: fine-tuned LLMs can be biased, toxic, harmful, etc. This is where Reinforcement Learning from Human Feedback (RLHF) comes into play.

RLHF provides different answers to the LLM, which are ranked according to a desired behavior (helpfulness, toxicity, etc.). The model learns to output the best answer among these candidates, hence mimicking the behavior we want to instill. Often seen as a way to censor models, this process has recently become popular for improving performance, as shown in neural-chat-7b-v3–1.

In this article, we will create NeuralHermes-2.5, by fine-tuning OpenHermes-2.5 using a RLHF-like technique: Direct Preference Optimization (DPO). For this purpose, we will introduce a preference dataset, describe how the DPO algorithm works, and apply it to our model. We'll see that it significantly improves the performance of the base model on the Open LLM Leaderboard.

As per usual, the code is available on GitHub and Google Colab.

Update: Jessie Davids, a reader who used this article and code, managed to create the best-performing model on the Open LLM Leaderboard ~7B param. Congrats to him!

Tags: Artificial Intelligence Data Science Editors Pick Large Language Models Programming

Add Fav

Comment

Murphy

Add friends

View space

Message

Recommend

◦ Approximating Stochastic Functions with Multivariate Outputs

◦ How I Won Singapore's GPT-4 Prompt Engineering Competition

◦ How Generative AI Can Support Food Industry Businesses

◦ Demystifying Azure Storage Account network access

◦ Engineering the Future: Common Threads in Data, Software, and Artificial Intelligence

◦ Understanding the Two Faces of Shiny for Python: Core and Express

◦ Reinforcement Learning, Part 3: Monte Carlo Methods

◦ Bursting the AI Hype Bubble Once and for All

◦ Non-Parametric Tests for Beginners (Part 1: Rank and Sign Tests)

◦ A Guide to Python's Weak References Using weakref Module

◦ Smaller Is Smarter

◦ Long Short Term Memory (LSTM)- Improving RNNs