LLM Fine Tuning

What is fine tuning in LLMs and why is it necessary?

Due to the unsupervised nature of training LMs, it is difficult to achieve precise control of their behavior. Pre-trained models are only trained on a next-token prediction task. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF).

For example, we may want our AI coding assistant to understand **common programming mistakes in order to correct them, nevertheless, when generating code, we would like to bias our model toward the (potentially rare) high-quality coding ability present in its training data / provide additional preference training data.

At a high level, existing methods (RLHF/ RLAIF) instill the desired behaviors into a language model using curated sets of human preferences representing the types of behaviors that humans find safe and helpful.

RLHF

It includes three phases: 1) Supervised fine tuning (SFT); 2) preference sampling and reward learning and 3) RL optimization.

Untitled

SFT: RLHF typically begins by fine-tuning a pre-trained LM with supervised learning on high-quality data for the downstream task(s) of interest (dialogue, summarization, etc.), to obtain a model $\pi_{SFT}$.

Reward Modelling Phase: We initialize a policy $\pi_{SFT}$, and then fine-tune it to perform the task well using RL. The dataset is collected by asking human labelers to pick which of several values of $y_i$ is the best response to a given input $x \in X$ . The human labeler choose between four options $(y_0, y_1, y_2, y_3)$. Let $b \in \{ 0, 1, 2, 3 \}$ be the option they select. Having collected a dataset $S$ of $(x, y_0, y_1, y_2, y_3, b)$ tuples, we fit a reward model $r:X×Y →R$ using the loss:

$$ \text{loss}(r) = -\mathbb{E}_{(x,\{y_i\}_i,b) ∼ S} \left( \log \frac{\exp (r(x,y_b))}{ \sum_i \exp( r(x,y_i))}\right)\hspace{4em}(1)

There are a number of approaches used to model preferences, the Bradley-Terry (BT) model being a popular choice and is given in $(1)$. The preferences are assumed to be generated by some latent reward model $r(y, x)$, which we do not have access to. In context of LMs, a network $r_\phi(x, y )$ can be trained to approximate the underlying latent reward model $r(x,y)$.

RL Fine-Tuning Phase: During the RL phase, we use the learned reward function to provide feedback to the language model. In particular, the following optimization problem is to solved:

$$ \max_{\pi_\theta} \mathbb{E}{x∼D,y∼\pi\theta(y|x)}[r_\phi(x,y)] − \beta\, \mathbb{D}{KL}[\pi\theta(y | x) || \pi_{ref}(y | x)] \hspace{4em}(2)

, where $\beta$ is a parameter controlling the deviation from the base reference policy $\pi_{ref} = \pi_{SFT}$. The standard approach has been to construct the reward function $r(x, y) = r_\phi (x, y) − \beta (\log \pi_\theta(y | x) − \log \pi_{ref}(y | x))$, and maximize using Proximal Policy Optimization (PPO).

PPO has a lot of hyperparameters to tune, takes a lot of time.

RLHF

Direct Preference Optimization