What is fine tuning in LLMs and why is it necessary?
Due to the unsupervised nature of training LMs, it is difficult to achieve precise control of their behavior. Pre-trained models are only trained on a next-token prediction task. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF).
For example, we may want our AI coding assistant to understand **common programming mistakes in order to correct them, nevertheless, when generating code, we would like to bias our model toward the (potentially rare) high-quality coding ability present in its training data / provide additional preference training data.
At a high level, existing methods (RLHF/ RLAIF) instill the desired behaviors into a language model using curated sets of human preferences representing the types of behaviors that humans find safe and helpful.
It includes three phases: 1) Supervised fine tuning (SFT); 2) preference sampling and reward learning and 3) RL optimization.
SFT: RLHF typically begins by fine-tuning a pre-trained LM with supervised learning on high-quality data for the downstream task(s) of interest (dialogue, summarization, etc.), to obtain a model $\pi_{SFT}$.
Reward Modelling Phase: We initialize a policy $\pi_{SFT}$, and then fine-tune it to perform the task well using RL. The dataset is collected by asking human labelers to pick which of several values of $y_i$ is the best response to a given input $x \in X$ . The human labeler choose between four options $(y_0, y_1, y_2, y_3)$. Let $b \in \{ 0, 1, 2, 3 \}$ be the option they select. Having collected a dataset $S$ of $(x, y_0, y_1, y_2, y_3, b)$ tuples, we fit a reward model $r:X×Y →R$ using the loss:
$$ \text{loss}(r) = -\mathbb{E}_{(x,\{y_i\}_i,b) ∼ S} \left( \log \frac{\exp (r(x,y_b))}{ \sum_i \exp( r(x,y_i))}\right)\hspace{4em}(1)
$$
There are a number of approaches used to model preferences, the Bradley-Terry (BT) model being a popular choice and is given in $(1)$. The preferences are assumed to be generated by some latent reward model $r(y, x)$, which we do not have access to. In context of LMs, a network $r_\phi(x, y )$ can be trained to approximate the underlying latent reward model $r(x,y)$.
RL Fine-Tuning Phase: During the RL phase, we use the learned reward function to provide feedback to the language model. In particular, the following optimization problem is to solved:
$$ \max_{\pi_\theta} \mathbb{E}{x∼D,y∼\pi\theta(y|x)}[r_\phi(x,y)] − \beta\, \mathbb{D}{KL}[\pi\theta(y | x) || \pi_{ref}(y | x)] \hspace{4em}(2)
$$
, where $\beta$ is a parameter controlling the deviation from the base reference policy $\pi_{ref} = \pi_{SFT}$. The standard approach has been to construct the reward function $r(x, y) = r_\phi (x, y) − \beta (\log \pi_\theta(y | x) − \log \pi_{ref}(y | x))$, and maximize using Proximal Policy Optimization (PPO).
PPO has a lot of hyperparameters to tune, takes a lot of time.