Exploring text to image models

Posted by @afiaka87 and @rossjillian

I’m Clay, a member of the team at Replicate. In this post, I’ll show you how Replicate allows you to easily explore many open text to image models.

You can follow along by downloading the accompanying Jupyter notebook here

If you’re running the notebook in Colab, it’s recommended to use Firefox/Chrome.

Install

It’s wise to use a virtual environment to keep your global Python installation clean. venv or conda will work fine:

python3 -m venv replicate_venv
source replicate_venv/bin/activate

or

conda create --name replicate_venv
conda activate replicate_venv

In whatever environment you choose, install Replicate’s Python client.

(replicate_venv) % pip install replicate

Login

To use the API, you’ll need an API access token. You can get a token by subscribing to Replicate. Then, you’ll be able to log in to Replicate using your API token each time you need to run Python.

You should never store your API key directly in a Python file or notebook - this would enable others to gain unauthorized access to your account. Instead, it is recommended to set the REPLICATE_API_TOKEN variable in your shell prior to running Python:

(replicate_venv) % REPLICATE_API_TOKEN="..." python run_text2image_model.py

If you’re in a Jupyter notebook, you can use getpass to receive user input beneath a cell without displaying it.

Assuming everything worked, you should be able to import the replicate module now. In our accompanying notebook, we also include pathlib.Path, which is needed for some inputs.

import replicate
from pathlib import Path

Generate an image from text

Using a few lines of Python, you can programmatically generate an image via text.

Replicate allows us to look up models by f"{username}/{model_name} with replicate.models.get.

For the example in our notebook, we’ll use “afiaka87/glid-3-xl”, a great model for generating photorealistic images. For fun, let’s generate an image of an avocado lightbulb!

model = replicate.models.get("afiaka87/glid-3-xl")
version = model.versions.get("d74db2a276065cf0d42fe9e2917219112ddf8c698f5d9acbe1cc353b58097dab")

Models on Replicate are run using the .predict method. Let’s take a quick look at the named/keyword arguments for .predict.

The named/keyword arguments for each model will vary. glid-3-xl requires one input prompt - a scene description you would like to visualize.

We also set the seed to 0. Setting a manual seed will encourage models to return the exact same output for a set of inputs. Otherwise, a seed will be chosen randomly. Outputs may still differ slightly, but managing a seed is generally a good idea and lots of models on Replicate have support for it.

prediction_generator = version.predict(prompt="an image of a fresh avocado in the form of a lightbulb", seed=0)

Calling .predict simply initializes the model, but does not queue it to be run on Replicate. To run the model, simply iterate over the generator.

generated_image_batches = list(prediction_generator)
final_image_batch = generated_image_batches[-1] # ["https://...",]
print(final_image_batch)

Because we are only interested in the final, finished output the model returns, we can just cast the generator to a list and grab the last (-1) element.

The final batch is a list of urls with size is determined by batch_size (1 by default).

generation for 'an image of a fresh avocado in the form of a lightbulb'

Enhance an image

The API opens up lots of possibilities, like passing the output from one model as the input to another. A common example of this is upscaling, where an image generated by one model is piped into a super-resolution model to enlarge it.

We’ll use “raoumer/srrescgan” to upscale our image of an avocado lightbulb, but there are many upscaling models on Replicate that you can explore.

generation_to_enhance = Path(final_image_batch[0]) # There's only one URL in the list by default.
upscaling_model_api = replicate.models.get("raoumer/srrescgan")
high_res_outputs = upscaling_model_api.predict(image=generation_to_enhance)

an image generated by glid-3-xl is then upscaled using another super-resolution model

Create variations of an image

Some text-to-image models allow you to pass in an existing image called an init image. This produces different variations of your image, with some influence from the specified prompt.

We’ll use “laion-ai/ongo”, a version of glid-3-xl finetuned on WikiArt.

You’ll need an image to create variations of. We’ll use ongo to vary the image of this farmhouse:

farmhouse as init image

Image inputs to Replicate may be a URL or local path.

init_image = Path("https://replicate.com/static/blog/exploring-text-to-image-models/farmhouse.jpeg")

It can be valuable to tweak with various settings to improve model performance, or you can simply remove optional arguments to use the default values set by the model author.

  • init_image: A pathlib.Path “initial image” to mix with the generation, causing the model to take influence from the provided image in addition to the specified prompt. Can also be a URL (cast as a Path)
  • guidance_scale: Determines how much the generation should be guided by your text.
  • batch_size: Integer from 1 - 12. How many variations should be produced. Low batch sizes are much faster than high batch sizes.
  • steps: Integer from 30-250. Number of discrete timesteps to run the model for. When using an init_image, the actual number of timesteps will be steps * init_skip_fraction (half as many by default). Increasing will improve accuracy at cost of performance.
  • init_skip_fraction: Decimal from 0.0 to 1.0. 0.5 by default. How much influence your image will have on the generation. 0.0 will use almost none, 1.0 will simply encode your image without influence from the model.

When in doubt, you can simply remove an argument and its default will be used instead.

model = replicate.models.get("laion-ai/ongo")
version = model.versions.get("1b3cd15121ec450baa71bbbdacddef9217519f12ca12ccfef36eeaa20ad89b9d")
ongo_variation_generator = version.predict(
    prompt="professional painting of a red lakehouse in the style of monet",
    guidance_scale=10.0, # 1.0 - 100.0
    total_steps=250, # 30-250
    init_skip_fraction=0.35, # 0.0 - 1.0
    batch_size=3, # 1 - 12
    init_image=Path("https://replicate.com/static/blog/exploring-text-to-image-models/farmhouse.jpeg"),
    seed=0, #
)

Recall .predict simply returns a generator. To start the prediction, you must first enumerate through it to get your final batched output URL’s. We are only interested in the last batch.

ongo_variations_final_batch = list(ongo_variation_generator)[-1]

Using a batch size of 3 means we should get back 3 image URL’s.

number_of_variations = len(ongo_variations_final_batch) # should be equal to batch size
print(ongo_variations_final_batch) # should be a list of URL's

The output of our inference is a series of beautiful lakehouses in the style of the original farmhouse image!

output from ongo of a farmhouse turned lakehouse

Explore

Lacking inspiration? Model not outputting what you want? Sometimes text to image models can be great at some things but just completely fail at other things. Getting them to perform the way you want them to without updating the weights of the network is sometimes referred to as prompt engineering. Prompt engineering is pretty difficult: we’ll be releasing a blog post about our experiences with prompt engineering soon.

There is a lot of opportunity for creative uses of the API. If you have any other creative ways to access models on Replicate, feel free to share on our Discord!