Use cases for open source language models

The release of open source models has led to an explosion of different models, each with their own strengths and weaknesses. This page is a guide to the different use cases for language models, and how to choose the right model for your needs.

Currently, the most powerful models have not been released as open source. These models are trained on massive amounts of data, and are often fine-tuned on a specific task. They are also often trained on data that is not publicly available, such as the entirety of the internet.

Models like OpenAI’s GPT-4, Anthropic’s Claude or Google’s Bard are powerful generalists. They have a lot of factual knowledge, reasoning skills and can perform a wide variety of tasks. They are often accessed via an API, and are not available for download. The only way to use these models is to pay for access.

For complicated tasks or specific domains, closed models may be the only option. But open source is catching up quickly, and for many tasks, open source models are already competitive.

There are three main types of language models:

Base models, also known as foundation models, are LLMs that have been trained on raw text. They function autoregressively: the model takes a prompt and predicts a token, then uses the prompt+token to predict the next token, and so on.

Base models are the most flexible type of LLM and can be prompted or fine-tuned for a wide variety of tasks. But they can be tricky to prompt, because they don’t take instructions. You have to think carefully about how to phrase your prompt so that the model will continue in the right direction.

Instruction-tuned models are base models that have been fine-tuned on a dataset of instruction-answer pairs. This process teaches the model to follow instructions, and makes it easier to prompt. Instruction-tuned models are often fine-tuned on a specific task, such as summarization, translation or question answering.

Some models have been fine-tuned using reinforcement learning (RL). There are various methods for this, with acronyms like RLHF, RLAIF, DPO, PPO, etc. RL-tuning is often used to align very large models with human preferences, or to fine-tune models on a specific task.

RL tuning is a powerful technique, but it is not yet clear how much it improves performance. It is also difficult to reproduce, and requires a lot of compute. For these reasons, RL-tuned models are not yet widely used.

Different models are good at different things. Here are some guidelines for choosing the right model for your needs.

Language models can accomplish a variety of tasks, including:

  • Classification: classify text into categories
  • Conversion: convert text from one format to another
  • Completion: complete a text prompt
  • Chat: have a conversation with a model
  • RAG: retrieve and generate text
  • Code gen: generate code
  • Grammars: generate text with a specific structure
  • Tool use: use a model to select and call other models or functions
  • Agents: let a model plan and execute actions
  • Multimodal: understand images, video, audio, etc.

These are not mutually exclusive. For example, a model can be used for both classification and completion. But some models are better suited to certain tasks than others.

In most cases, you want to use the smallest model that can do the job. Smaller models are cheaper, faster and more efficient. But smaller models are also less powerful and less capable of generalization, and may not be able to do what you need.

The tasks above are listed in order of complexity (roughly), from easiest to hardest. The more complex the task, the larger the model you will need. It’s worth trying several models to see which one works best for your needs.