lucataco/emu3.5-image

Native multimodal models are world learners

Public
34 runs

Run time and cost

This model runs on Nvidia H100 GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Emu 3.5 Image

Generate images from text prompts or transform existing images with BAAI’s 34 billion parameter model. Emu 3.5 Image is particularly good at rendering readable text in images and creating photorealistic scenes.

What it does

Emu 3.5 Image handles two main tasks:

Text-to-image: Give it a text prompt and it’ll generate an image. It’s especially strong at creating images that include text elements like signs, posters, or product labels where the text actually needs to be readable.

Any-to-image: Feed it an image along with instructions, and it’ll transform or edit that image. You can change materials, adjust lighting, swap backgrounds, or make other modifications while keeping the overall composition intact.

Why it’s different

Most image generation models struggle with text. If you ask them to create a poster with specific words, you often get garbled letters that look almost right but aren’t actually readable. Emu 3.5 Image uses a technique that treats text as distinct elements rather than just part of the visual noise, so the typography comes out much cleaner.

The model was trained on over 10 trillion tokens from internet videos, which helps it understand how real-world scenes work. This shows up in the results: lighting looks natural, materials have realistic textures, and human subjects don’t have that plastic, over-processed look you get from some models.

What it’s good at

Typography and text: Create images with readable headlines, signs, labels, and other text elements. While it’s not perfect for small or complex typography, it handles poster-sized text much better than most alternatives.

Natural lighting: The model understands how light behaves, so you get believable shadows, reflections, and atmospheric effects. Great for product shots or portraits where lighting matters.

Photorealistic portraits: Skin tones look natural, eyes have good detail, and hands usually render correctly at normal framing distances. The model was trained on diverse examples, so it handles different skin tones well under mixed lighting.

Material accuracy: Glass, metal, fabric, and other materials look convincing. Wet surfaces, reflections, and textures come through without looking overly shiny or artificial.

Style flexibility: The model can swing from photorealistic to painterly styles. Ask for gouache, watercolor, or specific artistic techniques and it’ll adjust accordingly.

Tips for better results

Be specific about what you want: Instead of “a coffee mug,” try “ceramic mug on walnut desk, soft daylight, sharp focus, clean.” The model responds well to structured prompts that describe the subject, style, lighting, and any specific qualities you need.

For text in images: Put the actual text you want in quotes or call it out clearly. “Poster with headline: Fresh Start” works better than “poster about fresh starts.” Keep in mind the model works best with headline-sized text rather than paragraphs of small copy.

Lighting matters: Specify your lighting conditions. “Window light,” “golden hour,” “soft diffused light,” or “rim-lit with volumetric fog” all push the model toward different looks.

When editing images: Be clear about what should change and what should stay the same. “Replace background with soft gradient, keep subject lighting” gives the model better direction than “improve this image.”

Aspect ratios: The model defaults to square images (1024×1024), but you can adjust for different use cases. Wider ratios work well for banners, vertical for mobile content.

What to expect

The model generates images at 1024×1024 resolution by default. Generation typically takes several seconds depending on complexity. You’ll get clean, coherent results that often need minimal touch-ups for professional use.

For portraits and everyday scenes, the model passes what some call the “squint test”—at a glance, the image feels like a real photo. Fine details like hair, skin texture, and background elements render naturally without excessive sharpening.

Text rendering is better than most models, but it’s not perfect for every use case. Big, bold text on signs or posters usually works well. Tiny text on far-away objects or very complex multi-line layouts can still wobble. If you need pixel-perfect typography for a final design, use the model to get your layout and mood right, then add the final text in your design tool.

The technical bit

Emu 3.5 Image comes from BAAI (Beijing Academy of Artificial Intelligence). It’s built on a 34 billion parameter transformer architecture and uses something called Discrete Diffusion Adaptation, which speeds up image generation by about 20 times compared to earlier autoregressive approaches.

The model was trained end-to-end on over 10 trillion multimodal tokens, primarily from internet videos and their transcripts. This training approach helps it understand temporal consistency, how scenes evolve, and how real-world physics works—which translates to more believable generated images.

Try it yourself

You can run BAAI’s Emu 3.5 Image model on the Replicate Playground at replicate.com/playground.