This is a quick demo of a way to "see" latents using the CompVis latent-diffusion model. The uploaded image is resized to 256x256 then encoded, which creates a 4 dimensional 32x32 tensor containing the latents representing it. It so happens we can turn this (or as is being done here, the mean of it) into either an RGBA image or 4 monochrome images, which are then upsampled back to 256x256 using simple nearest neighbor. Much of the structure of the image is retained with this specific approach, perhaps giving interesting insight into latent space.
Run time and cost
Predictions run on Nvidia T4 GPU hardware. Predictions typically complete within 4 minutes. The predict time for this model varies significantly based on the inputs.