Fine-tuning open source large language models (LLMs)

Most language models are trained on huge datasets that make them very generalizable.

For example, Llama 1 was trained on a 1.4 trillion token dataset, including Wikipedia, millions of web pages via CommonCrawl, open source GitHub repos, and Stack Exchange posts1. Giant training datasets like this are a big part of why language models can do everything from writing code to answering historical questions to creating poems.

But generalized language models aren’t good at everything, and that’s where fine-tuning comes. Fine-tuning is the process of taking a pre-trained language model and training it on a smaller, more specific dataset. This allows you to customize the model to get better at a particular task.

For example, we might want to:

  • Learn information that wasn’t provided in the training data: The original ChatGPT was famously trained on data up to 2021. What if you want your model to know about the FTX crash? Or Taylor Swift and Travis Kelce?!
  • Generate text in a particular style: Train your model to speak like you and your friends, or your favorite author.
  • Train a smaller model to perform as well (or better) than a larger model on a particular task: Mistral 7B can be trained to perform as well as GPT-4 on a particular task, but with much lower latency and cost.

Some of these tasks can be accomplished by adjusting your prompt, but we’ll always be limited by our context window. Fine-tuning allows us to improve our model without being limited by the context window.

The first step of fine-tuning is to collect your data. Generally, training data is in the format of a jsonl text file, where each line is a JSON object with prompt/completion keys or text key.

If you’re building an instruction-tuned model, like a chat bot that answers questions, structure your data using an object with a prompt key and a completion key on each line:

{"prompt": "...", "completion": "..."}
{"prompt": "Why don't scientists trust atoms?", "completion": "Because they make up everything!"}
{"prompt": "Why did the scarecrow win an award?", "completion": "Because he was outstanding in his field!"}
{"prompt": "What do you call fake spaghetti?", "completion": "An impasta!"}

If you’re building an autocompleting model to do tasks like completing a user’s writing, code completion, finishing lists, few-shotting specific tasks like classification, or if you want more control over the format of your training data, structure each JSON line as a single object with a text key and a string value:

{"text": "..."}
{"text": "..."}
{"text": "..."}

For more on structuring your dataset for fine-tuning, check out our fine-tuning docs.

You can fine-tune open source models on lots of hosting providers, or on your own machine. We’re biased, but we’d recommend Replicate 😉, but services like Google Colab and Brev.dev are also good options.

If you want to fine-tune a closed model like GPT-3.5, you’ll need to use OpenAI’s API.

On Replicate we have a fine-tuning guide that walks you through the process of fine-tuning a model on Replicate. If you want to fine tune on Colab, this notebook is a good starting point.

There are lots of parameters you can adjust when you fine-tune. On Replicate, every model that can be fine-tuned has a “Train” tab that lets you adjust these parameters. We try and set you with reasonable defaults.

For example, here are the parameters for Llama 2 7B:

  • train_data (required): URL to a file of training data where each row is a JSON record in the format {"text": ...} or {"prompt": ..., "completion": ...}. Must be JSONL. num_train_epochs (optional, default=3): Number of epochs (iterations over the entire training dataset) to train for.
  • train_batch_size (optional, default=4): Global batch size. This specifies the batch size that will be used to calculate gradients. Optimal batch size is data dependent; larger sizes train faster but may cause OOMs. 8 often works well for this configuration of llama-2-7B.
  • micro_batch_size (optional, default=4): Micro batch size. This specifies the on-device batch size, if this is less than train_batch_size, gradient accumulation will be activated.
  • num_validation_samples (optional, default=50): Number of samples to use for validation. If run_validation is True and validation_data is not specified, this number of samples will be selected from the tail of the training data. If validation_data is specified, this number of samples will be selected from the head of the validation data, up to the size of the validation data.
  • validation_data (optional): URL to a file of eval data where each row is a JSON record in the format {"text": ...} or {"prompt": ..., "completion": ...} or {"prompt": ..., "completion": ...}. Must be JSONL.
  • validation_batch_size (optional, default=1): Batch size for evaluation. For small validation sets, you should use the default batch size of 1.
  • run_validation (optional, default=True): Whether to run validation during training.
  • validation_prompt (optional, default=None): If provided, this prompt will be used to generate a model response during each validation step. Must be a string formatted prompt. Note: this is not implemented for QLoRA training.
  • learning_rate (optional, default=1e-4): Learning rate!
  • pack_sequences (optional, default=False): If ‘True’, sequences will be packed into a single sequences up to a given length of chunk_size. This improves computational efficiency.
  • wrap_packed_sequences (optional, default=False): If ‘pack_sequences’ is ‘True’, this will wrap packed sequences across examples, ensuring a constant sequence length but breaking prompt formatting.
  • chunk_size (optional, default=2048): If ‘pack_sequences’ is ‘True’, this will chunk sequences into chunks of this size.
  • peft_method (optional, default=’lora’): Training method to use. Currently, only ‘lora’ and ‘qlora’ are supported.
  • seed (optional, default=42): Random seed to use for training.
  • lora_rank (optional, default=8): Rank of the LoRA matrices.
  • lora_alpha (optional, default=16): Alpha parameter for scaling LoRA weights; weights are scaled by alpha/rank
  • lora_dropout (optional, default=0.05): Dropout for LoRA training.

Want to try fine-tuning a model? Check out our fine-tuning guide to get started.

  1. LLaMA: Open and Efficient Foundation Language Models