Over the last few months, Retrieval Augmented Generation (RAG) has emerged as a popular technique for getting the most out of Large Language Models (LLMs) like Llama-2-70b-chat.
In this post, we’ll explore the creation of an example RAG “app” which helps you generate click-worthy titles for Hacker News submissions. All you need to do is provide a working title, idea, or phrase, and even the most boring of words will be transformed into a title destined for the front page of Hacker News.
Admittedly, this is a basic toy idea. It’s not revolutionary, and it may not land your post on the front page of Hacker News. That’s okay, because that’s not the point: the point is to provide you with a practical hands-on feel for how RAG works, and give you the understanding you need to use this technique in your own projects and systems.
Retrieval augmented generation is a technique of enriching your language model outputs by retrieving contextual information from an external data source, and including that information as part of your language model prompts. The idea is that when you augment a language model prompt with meaningful external data, the language model is able to respond with deeper understanding and relevance.
This sort of pattern effectively extends the functional context length of a given language model, because instead of being limited to 4,096 tokens (5-ish pages of text), you can query an entire 1,000 page book for meaningful passages, and pull in just the handful of sentences needed to create high-quality responses.
RAG is very flexible in that you can use the same data sources across multiple language models or upgrade to the latest language models without needing to re-train or fine-tune any specific models.
Other techniques such as fine-tuning can provide great results for specific use-cases like setting the style/tone of a language model, or teaching a language model how to perform a task or skill that is difficult or too large to explain in every prompt. But if your goal is to integrate an LLM with data, RAG is the way.
You may have heard that you need to use embeddings and a vector store to build a RAG application, but that is not entirely true. Those components can be helpful, and we will cover them in this post, but they are not strictly required.
At its most basic level, RAG is really just pasting additional context into the text of your language model prompt. It can be as simple as retrieving the current weather for San Francisco from a realtime weather API, and then passing the JSON response forward in the text of the prompt. Nothing fancy.
In the example prompt below, we instruct the mistral-7b-instruct model to read the input JSON data and respond with a weather report, as if it were a human meteorologist:
Responds with:
The ability to pass arbitrary textual information into an LLM prompt is part of what make them so interesting and exciting. You could include a small CSV table, passages from a book, code, quotes from inspiring philosophers, or even application state. Anything, really, as long as it is smaller than the language model maximum context length, which is 4,096 tokens for Llama2.
Now that we're on the same page and understand that RAG will solve all of your problems 🤭, let's explore how to actually use it with a hands-on example.
As mentioned above, we’re going to build an example that converts titles, ideas, or phrases into hackernews titles. To make this example a reality, we’ll tap into the wisdom of the crowd by indexing titles of top stories from hackernews into a vector store, and making them queryable via embedding similarity search.
The idea is that inspiration from titles with a proven track record will help to create a better title suggestion.
The general flow of how this will work together goes something like:
The first thing we need to do is create a dataset of Hacker News titles. Creating this sort of dataset from scratch is kind of annoying and surprisingly complicated, so I’ve done the heavy lifting of scraping the Hacker News API and distilling stories down to a dataset of 13,509 top stories, each of which have received 100+ upvotes from January 2023 to early October 2023.
The dataset consists of id
, score
, title
, url
, and time
fields.s
You can download the data from the git repo for this blog post, here: 13509-hn-topstories-2023.jsonl
I won’t bore you with all of the details, but I have included the two scripts I hacked together to build this nice jsonl file. The first script scrapes the hackernews API by story ID, and saves all of the stories from a specific id range to an sqlite database. The second script queries the sqlite database for stories with 100 or more upvotes, and writes them to a .jsonl
file.
For our purposes, we really only care about the title
fields, since we’ll be using that field for embeddings, and all of the entries are “popular” because they each have greater than 100 upvotes.
Great, now that we have a JSONL
dataset, we need to generate embeddings for each of the 13,509 titles and then load everything in a vector store.
But let’s break that down, because the whole idea behind embeddings and vector stores can be confusing.
Embedding vectors are numerical representations of content, formatted as a large list of floating point numbers. I like to think of embeddings as some sort of mind-boggling coordinate system. Sort of like how GPS maps have 2 coordinates known as latitude and longitude. Embeddings are kind of the same thing, except instead of 2 dimensions, there are 1024 dimensions 🤯.
When embedding vectors begin to cluster together, the documents they represent tend to be similar in semantic meaning, syntax, or style.
Don’t worry if this doesn’t make sense yet, we’ll try to help you visualize things later on in the post.
There are many embedding models to choose from. For this post, we will generate embeddings by using the the bge-large-en-v1.5 model on Replicate. This model was released in September 2023, and at the time of this writing it's the top performing embedding model on HuggingFace’s embeddings model leaderboard.
We’ve put together a collection of embedding models for you to try on Replicate. These include more text embedding models such as all-mpnet-base-v2, image embedding models such as clip-features, and even multi-modal embeddings that can generate embeddings for text, images, and audio in the same space like imagebind.
Vector stores are a special type of database where we can store and query embeddings, and their associated documents. Vector stores really shine when you are searching for things that are “semantically similar”. For example, if you wanted to search for documents similar to “cat with a hat,” you might find results related to cats, hats, or other animals wearing hats.
For this post, we will be using ChromaDB, an open-source vector store database. Our usage patterns of chroma are very basic, and you could easily swap in a variety of other vector store databases instead.
Some popular alternatives are Pinecone, Weaviate, and pgvector.
Before running any code, you’ll need to install a few Python dependencies.
In the following script, we perform a handful of operations.
./chromadb
directory. This is where the database files will live.JSONL
entries into a list of dictionaries.The script uses tqdm
to display a progress bar, along with a time estimate. It takes ~5.5s to generate embeddings for and insert all of the metadata for 250 titles. The whole process should take about 5 minutes.
When you run the script with python index_hn_titles.py
, you’ll find that the ChromaDB data is persisted to the ./chromadb
directory.
Awesome. Now that we have a populated vector store database, how can we verify that everything worked as expected? There are two ways I like to test out indexed embeddings.
First, and easiest, I like to actually just make a query with an example string, and see if the results look correct. Second, I like to visualize the embeddings in 2D or 3D space, to see if they are clustering together in a meaningful way.
In this script, we perform the following operations:
This code prints the following titles, which all seem to be in roughly similar to the test query string "how to create a sqlite extension"
As a sidebar, I like to build intuition about complex subjects by thinking about them visually.
The easiest way to visualize what is happening inside of embedding space is to plot embedding vectors on a chart as if they are points in a 2D or 3D chart. To do this, we need to first project our embedding vectors from 768 dimensions down to 3 dimensions. We can use dimensionality reduction techniques like t-SNE (t-distributed stochastic neighbor embedding) or UMAP (uniform manifold approximation and projection) to perform this reduction down to 3D space, which we can then use to easily display on screen.
My favorite way to display this sort of data is to use the Tensorflow Projector tool to visualize and interact with embedding vectors in 3D space.
Even at first glance of the video below, you can visually see that the points are clumping together. When clustering occurs like this, it represents that the clustered points are semantically similar to each other.
If there was no visual clumping, and it was just a massive mess of dots with no discernible pattern, it would suggest that something is not quite right with your embeddings.
The video below contains a visualization of the t-SNE dimension reduction process evolving into clusters over the course of a few hundred iterations. Towards the end of the video, you can see that when I click on a point that is about sqlite, all of the nearest neighbors are also about sqlite.
This always blows my mind, and feels like magic.
To use Tensorflow Projector, you need to convert your metadata and embeddings to a specific TSV format. I won’t cover how to do this step by step, but I’ve included a script in the git repo which translates an entire Chromadb collection into two files (embeddings.tsv
and metadata.tsv
), which can be loaded into Tensorflow Projector, and visualized in an interactive way.
Now we’ll bring it all together by passing in a working title, querying ChromaDB for 10 related popular titles, then prompting mistral-7b-instruct to suggest new titles that have been inspired by the 10 related popular titles.
I named this script hnify.py
, so I’m able to run it with the following command python hnify.py "teaching Elixir to my toddler"
. The script will print out 5 suggested titles, followed by the entire text prompt that was sent to mistral-7b-instruct.
When experimenting with new RAG prompts like this, it is always beneficial to inspect the fully populated prompt template. Reading the entire prompt will help you understand what the vector store is returning, and give you a feel for how your language model of choice is responding to the retrieval-sourced data.
Replicate makes iteratively tweaking prompts very easy. For example, after I ran this script I was able to find the single prediction where the language model prompt was being run, in the Replicate dashboard . Here’s my link: https://replicate.com/p/aosuuqlb43zvxbsptfylyewknq
On that link you can click the “Replicate” button, and quickly test prompt changes in the text field on the page. This workflow makes it very fast to iterate on language model prompts while using realistic vector store context.
Here is what my entire prompt looks like:
Finally, the script prints out these following hackernews-ified titles:
These are pretty good! I’ve found that in general it tends to suggest Unleashing
and Revolutionizing
, or marketing buzzwords too frequently. We’ll leave this as an exercise to the reader, but what changes would you make to the prompt to get it to be less marketing-y?
That’s a wrap! It is now time to post to Hacker News and see if the our new retrieval augmented generation overlords have helped me get internet famous.