Collecting images enables us to customize powerful machine learning models in new and exciting ways. For example, some of text-to-image models on Replicate can be steered using an existing image. This capability is great for when we want to steer vision models toward a particular scene or aesthetic, but it requires that we have example images of our own.
I'm Clay, a member of LAION and of the team at Replicate. In this post, I'm going to show you how to use a pip package called clip-retrieval
to collect hundreds of images (and captions) from the LAION-5B dataset. We'll look at how to collect images that either match a text description or have a similar style to some existing images.
clip-retrieval
was developed by a fellow member of LAION, Romain Beaumont. It works by embedding the billions of images and captions in the LAION dataset with CLIP. Using the magic of k-NN and autofaiss, we can create an in-memory index over these embeddings with fairly fast retrieval times. If you're interested in how this works on a technical level, I recommend reading Romain's article "Semantic search with embeddings: index anything".
Let's get started by installing clip-retrieval
:
pip install clip-retrieval
lets us query pre-built CLIP faiss indexes using the class ClipRetrieval.ClipClient
. By default, queries are sent to the free, hosted knn index over LAION-5B built by LAION-AI.
We can set a custom num_images
to return. Let's use 400 for now.
from clip_retrieval.clip_client import ClipClient, Modality
laion5b_search_client = ClipClient(
url="https://knn5.laion.ai/knn-service", # url may change, check github.com/rom1504/clip-retrieval
indice_name="laion5B",
num_images=400,
)
After getting set up, we can query the backend:
results = laion5b_search_client.query(text="fresh avocado, digital art")
The response will be a JSON array of results containing a caption, url, and similarity.
[
{
"caption": "Авокадо",
"url": "https://t1.ftcdn.net/jpg/00/79/43/44/240_F_79434473_qNSi5WUEi8y3oFrwPjupQvxbUIzXY7mE.jpg",
"id": 4540616960,
"similarity": 0.5977489948272705
} // ...more results
]
the second result for "fresh avocado, digital art"
Because the API de-duplicates results, we won't get exactly 400 back.
print(len(cat_results))
$ 321
But, hey - 321 isn't half bad!
I love to use this as a way of finding good init images for various text-to-image models. Init images guide a text-to-image model to produce different variations of your image, with some influence from the specified prompt. In some cases, using an init image can even make the model run faster (I also mentioned init images in my previous blog post).
We can use Replicate to explore the effect of init images easily. First, we get set up on Replicate:
pip install replicate
Grab your API token from here, then set your API token as an environment variable.
export REPLICATE_API_TOKEN=...
Now, we can run text-to-image models remotely! I use "afiaka87/glid-3-xl", a photorealistic image model that takes a prompt
and an init_image
argument. The init_image
argument conveniently accepts URL's, so there's no need to download clip-retrieval
results in advance. Let's use the first result from our search as an init image:
model = replicate.models.get("afiaka87/glid-3-xl")
version = model.versions.get("d74db2a276065cf0d42fe9e2917219112ddf8c698f5d9acbe1cc353b58097dab")
text2image_generations = list(
version.predict(
prompt="fresh avocado, digital art",
guidance_scale=10.0,
batch_size=3,
init_image="https://t1.ftcdn.net/jpg/00/79/43/44/240_F_79434473_qNSi5WUEi8y3oFrwPjupQvxbUIzXY7mE.jpg",
steps=100,
init_skip_fraction=0.5,
seed=0,
)
)[
-1
] # grab the final generation - don't need intermediate outputs.
print(text2image_generations)
Another cool thing we can do with clip-retrieval
is take an existing image and try to find images similar to it.
For this, we will need CLIP. Let's load CLIP with some helper methods for converting torch tensors to the numpy arrays that clip-retrieval
expects.
You can find similar examples and usage in the official clip_retrieval.clip_client notebook.
import clip
import torch
model, preprocess = clip.load("ViT-L/14", device="cpu", jit=True)
import urllib
import io
import numpy as np
from PIL import Image
def download_image(url):
urllib_request = urllib.request.Request(
url,
data=None,
)
with urllib.request.urlopen(urllib_request, timeout=10) as r:
img_stream = io.BytesIO(r.read())
return img_stream
def get_image_emb(image_url):
with torch.no_grad():
image = Image.open(download_image(image_url))
image_emb = model.encode_image(preprocess(image).unsqueeze(0).to("cpu"))
image_emb /= image_emb.norm(dim=-1, keepdim=True)
image_emb = image_emb.cpu().detach().numpy().astype("float32")[0]
return image_emb
clip-retrieval
Instead of using text as an input and converting it into a text embedding, we now use images as an input and convert it to an image embedding.
Let's take this image of a model wearing a blue dress and find some similar images.
The input image is an image of a woman wearing a blue dress.
blue_dress_image_emb = get_image_emb("https://rukminim1.flixcart.com/image/612/612/kv8fbm80/dress/b/5/n/xs-b165-royal-blue-babiva-fashion-original-imag86psku5pbx2g.jpeg?q=70")
blue_dress_results = laion5b_search_client.query(embedding_input=blue_dress_image_emb.tolist())
blue_dress_results
Again, the response will be a JSON array of results containing a caption, url, and similarity.
[
{
"caption": "8c7889e0b92b Cinderella Divine 1295 Long Chiffon Grecian Royal Blue Dress Mid Length Sleeves V Neck ...",
"id": 2463946620,
"similarity": 0.9428964853286743,
"url": "https://cdn.shopify.com/s/files/1/1417/0920/products/1295cd-royal-blue_cfcbd4bc-ed74-47c0-8659-c1b8691990df.jpg?v=1527650905"
},
{
"caption": "Classy V-Neck A-Line Floor Length Zipper-Up Mother Of the Bride Dress",
"id": 717054383,
"similarity": 0.9329575896263123,
"url": "http://images.ericdress.com/Upload/Image/2014/44/270-360/0e842524-2bf0-44ef-be20-e9a6478db283.jpg"
}
// ...
]
The first result is "Cinderella Divine 1295 Long Chiffon Grecian Royal Blue Dress Mid Length Sleeves V Neck."
Querying Laion5B with text and images are just a few things you can do with clip-retrieval
.
One of the key benefits of deep learning is that, given enough data, we can scale or finetune models to improve general and/or task-specific ("downstream") performance. With clip-retrieval
, finetuning models with data that you curate is now possible. We're writing a future blog post to show you how to finetune models of your own and run them on Replicate. Stay tuned!
There are absolutely other use cases, too. If you have any other cool ideas, reach out on the Replicate's Discord. We'd love to hear them!