Yesterday we explained how to replicate Alpaca, a fine-tuned version of LLaMA that can follow instructions.
It was trained on a set of on instructions and answers to turn it into an assistant. It turns out you can replace that with a script from a TV show and it speaks in the voice of the characters from that TV show.
We were surprised about how easy this was. With a small amount of data (~60K lines of dialog) and 90 minutes of fine-tuning, you can make LLaMA output text in the voice of the dataset.
It makes a remarkably good Homer Simpson bot. Occasionally, it’s funny:
Marge Simpson: Where are the children?
Homer Simpson: I'll tell you one thing, they're not at Moe's.
Marge Simpson: How was your day, Homie?
Homer Simpson: Oh, it was just lovely, Marge. First off, I got to sleep in till five A.M. Then I had a leisurely breakfast of malt liquor and pickled beets.
Marge Simpson: Doesn't that give you a stomach ache?
Homer Simpson: Nah, I've got a cast-iron stomach. You should see me eat a ten pound ham.
Marge Simpson: What adventures did you get up to today, Homie?
Homer Simpson: I had a bit of a scare when I thought my pants were stuck in the toaster, but apart from that it's been a quiet, uneventful day.
Here’s how we made it.
First, grab simpsons_script_lines.csv
from The Simpsons dataset from Kaggle. This has the script for all Simpsons episodes up through Season 27.
We only pulled out the data for seasons 1-12, because they’re the good ones. The resulting dataset has 61k lines of dialog and 1.1M tokens.
We want to train LLaMA to reproduce the voice of the characters. LLaMA was originally designed to be a helpful assistant, and this task is a bit different.
To accomplish this, we generated a dataset for all scenes in the dataset consisting of previous lines in a given scene, the character with the next line, and that next line. Here’s an example:
Here’s a notebook which shows the parsing process.
Once we parsed these lines, we modified the training prompt and script from the existing Alpaca codebase so that the model is prompted to complete the line in the context of the scene.
Here’s the prompt:
From here on, it’s the same process as training Alpaca, which we covered in our blog post yesterday.
We made a few modifications to the training script, so after you’ve checked out the repository, you’ll need to switch to the homerbot
branch:
Once you have a trained model, you can generate scripts with cog predict commands like so:
Vanilla LLaMA and GPT-4 struggle to produce output that is in the voice of The Simpsons. Fine-tuning LLaMA to have the voice of a particular character was surprisingly fast and easy, and only possible because it's open-source.
We’re going to be posting more guides to tinkering on open-source language models. Follow us on Twitter to follow along.
We can't wait to see what you build, too. Join us in #llama on Discord to share what you’ve built.