Automating image collection

Posted by @afiaka87

Collecting images enables us to customize powerful machine learning models in new and exciting ways. For example, some of text-to-image models on Replicate can be steered using an existing image. This capability is great for when we want to steer vision models toward a particular scene or aesthetic, but it requires that we have example images of our own.

I’m Clay, a member of LAION and of the team at Replicate. In this post, I’m going to show you how to use a pip package called clip-retrieval to collect hundreds of images (and captions) from the LAION-5B dataset. We’ll look at how to collect images that either match a text description or have a similar style to some existing images.

clip-retrieval was developed by a fellow member of LAION, Romain Beaumont. It works by embedding the billions of images and captions in the LAION dataset with CLIP. Using the magic of k-NN and autofaiss, we can create an in-memory index over these embeddings with fairly fast retrieval times. If you’re interested in how this works on a technical level, I recommend reading Romain’s article “Semantic search with embeddings: index anything”.

Getting started

Let’s get started by installing clip-retrieval:

pip install clip-retrieval

lets us query pre-built CLIP faiss indexes using the class ClipRetrieval.ClipClient. By default, queries are sent to the free, hosted knn index over LAION-5B built by LAION-AI.

We can set a custom num_images to return. Let’s use 400 for now.

from clip_retrieval.clip_client import ClipClient, Modality

laion5b_search_client = ClipClient(
    url="https://knn5.laion.ai/knn-service", # url may change, check github.com/rom1504/clip-retrieval
    indice_name="laion5B",
    num_images=400,
)

Query LAION-5B with text

After getting set up, we can query the backend:

results = laion5b_search_client.query(text="fresh avocado, digital art")

The response will be a JSON array of results containing a caption, url, and similarity.

[
  {
    "caption": "Авокадо",
    "url": "https://t1.ftcdn.net/jpg/00/79/43/44/240_F_79434473_qNSi5WUEi8y3oFrwPjupQvxbUIzXY7mE.jpg",
    "id": 4540616960,
    "similarity": 0.5977489948272705
  } // ...more results
]

avocado

the second result for “fresh avocado, digital art”

Because the API de-duplicates results, we won’t get exactly 400 back.

print(len(cat_results))
$ 321

But, hey - 321 isn’t half bad!

Get variations of your image with a text2image model

I love to use this as a way of finding good init images for various text-to-image models. Init images guide a text-to-image model to produce different variations of your image, with some influence from the specified prompt. In some cases, using an init image can even make the model run faster (I also mentioned init images in my previous blog post).

We can use Replicate to explore the effect of init images easily. First, we get set up on Replicate:

pip install replicate

Grab your API token from here, then set your API token as an environment variable.

export REPLICATE_API_TOKEN=...

Now, we can run text-to-image models remotely! I use “afiaka87/glid-3-xl”, a photorealistic image model that takes a prompt and an init_image argument. The init_image argument conveniently accepts URL’s, so there’s no need to download clip-retrieval results in advance. Let’s use the first result from our search as an init image:

model = replicate.models.get("afiaka87/glid-3-xl")
version = model.versions.get("d74db2a276065cf0d42fe9e2917219112ddf8c698f5d9acbe1cc353b58097dab")
text2image_generations = list(
    version.predict(
        prompt="fresh avocado, digital art",
        guidance_scale=10.0,
        batch_size=3,
        init_image="https://t1.ftcdn.net/jpg/00/79/43/44/240_F_79434473_qNSi5WUEi8y3oFrwPjupQvxbUIzXY7mE.jpg",
        steps=100,
        init_skip_fraction=0.5,
        seed=0,
    )
)[
    -1
]  # grab the final generation - don't need intermediate outputs.
print(text2image_generations)

generation 1

generation 2

generation 3

Query Laion5B with images

Another cool thing we can do with clip-retrieval is take an existing image and try to find images similar to it.

For this, we will need CLIP. Let’s load CLIP with some helper methods for converting torch tensors to the numpy arrays that clip-retrieval expects.

You can find similar examples and usage in the official clip_retrieval.clip_client notebook.

Load CLIP

import clip
import torch

model, preprocess = clip.load("ViT-L/14", device="cpu", jit=True)

import urllib
import io
import numpy as np
from PIL import Image

def download_image(url):
    urllib_request = urllib.request.Request(
        url,
        data=None,
    )
    with urllib.request.urlopen(urllib_request, timeout=10) as r:
        img_stream = io.BytesIO(r.read())
    return img_stream

def get_image_emb(image_url):
    with torch.no_grad():
        image = Image.open(download_image(image_url))
        image_emb = model.encode_image(preprocess(image).unsqueeze(0).to("cpu"))
        image_emb /= image_emb.norm(dim=-1, keepdim=True)
        image_emb = image_emb.cpu().detach().numpy().astype("float32")[0]
        return image_emb

Convert your image to a CLIP embedding and pass the embedding to clip-retrieval

Instead of using text as an input and converting it into a text embedding, we now use images as an input and convert it to an image embedding.

Let’s take this image of a model wearing a blue dress and find some similar images.

input image: an image of a woman wearing a blue dress

The input image is an image of a woman wearing a blue dress.

blue_dress_image_emb = get_image_emb("https://rukminim1.flixcart.com/image/612/612/kv8fbm80/dress/b/5/n/xs-b165-royal-blue-babiva-fashion-original-imag86psku5pbx2g.jpeg?q=70")
blue_dress_results = laion5b_search_client.query(embedding_input=blue_dress_image_emb.tolist())
blue_dress_results

Again, the response will be a JSON array of results containing a caption, url, and similarity.

[
  {
    "caption": "8c7889e0b92b Cinderella Divine 1295 Long Chiffon Grecian Royal Blue Dress Mid Length  Sleeves V Neck ...",
    "id": 2463946620,
    "similarity": 0.9428964853286743,
    "url": "https://cdn.shopify.com/s/files/1/1417/0920/products/1295cd-royal-blue_cfcbd4bc-ed74-47c0-8659-c1b8691990df.jpg?v=1527650905"
  },
  {
    "caption": "Classy V-Neck A-Line Floor Length Zipper-Up Mother Of the Bride Dress",
    "id": 717054383,
    "similarity": 0.9329575896263123,
    "url": "http://images.ericdress.com/Upload/Image/2014/44/270-360/0e842524-2bf0-44ef-be20-e9a6478db283.jpg"
  }
  // ...
]

first result -

The first result is “Cinderella Divine 1295 Long Chiffon Grecian Royal Blue Dress Mid Length Sleeves V Neck.”

Final thoughts

Querying Laion5B with text and images are just a few things you can do with clip-retrieval.

One of the key benefits of deep learning is that, given enough data, we can scale or finetune models to improve general and/or task-specific (“downstream”) performance. With clip-retrieval, finetuning models with data that you curate is now possible. We’re writing a future blog post to show you how to finetune models of your own and run them on Replicate. Stay tuned!

There are absolutely other use cases, too. If you have any other cool ideas, reach out on the Replicate’s Discord. We’d love to hear them!