FLUX fine-tunes are now fast

Fine-tuned FLUX.1 model output — Generated with levelsio/analog-film

You can fine-tune FLUX on Replicate with your own data. We’ve made running fine-tunes on Replicate much faster, and the optimizations are open-source.

This builds upon our work from last month, where we made the FLUX base models much faster.

Running a fine-tune is now the same speed as the base model:

FLUX.1 [schnell] at 512x512 and 4 steps: 0.6 seconds (P50)
FLUX.1 [dev] at 1024x1024 and 28 steps: 2.8 seconds (P50)

In addition, the first time you run a fine-tune, it’ll take a bit of time to load the model. That’s usually about 2.5 seconds. Once it’s been loaded, we will attempt to route your requests to an instance that already has it loaded, and it will run as fast as the base model.

To enable all optimizations, pass go_fast=true to your prediction. If you omit the go_fast option, it will still be twice as fast as it was before, with no effect on output quality.

All models will get these optimizations automatically, both existing and future.

Load LoRAs from other places

If you’re not using Replicate to fine-tune models, we’ve also added support to load LoRAs from Hugging Face, Civitai, and arbitrary HTTP URLs.

Just pass a Hugging Face, Civitai, or HTTP URL to the lora_weights input in these new LoRA versions of FLUX:

How did we do it?

Most of the models on Replicate are contributed by our community, but we maintain the FLUX models in collaboration with Black Forest Labs.

We optimized the base models by using Alex Redden’s flux-fp8-api as a starting point, optimized it with torch.compile and then used fast CuDNN attention kernels in the nightly Torch builds. For more details, take a look at our blog post about optimizing the base models.

Fine-tunes on Replicate are represented as LoRAs. We quantize the LoRA as fp8, then merge the weights into the base model. We also automatically increase the lora_scale input by 1.5x when go_fast=true, because we’ve found that produces better output. You might want to play around with this too.

The quantization in flux-fp8-api slightly changes the output, but we have found it has little impact on the quality.

We want to be open with you about how we’re optimizing the models. It’s notoriously hard to compare output between models and providers, and it’s often unclear whether providers are doing things that impact the quality of the model. We’re just going to tell you how we did it and let you disable any optimizations.

Open-source should be fast too

Open-source models are often slow out of the box. Model providers then optimize these models to make them fast and release them behind proprietary APIs, without contributing the improvements back to the community.

We want to change that. We think open-source should be fast too.

We’re open-sourcing all the improvements we make to FLUX. Read more on our blog post about making the base models fast.

It’s going to get faster

This makes running fine-tuned models faster, but there is still work to be done to make the training process fast. Some major improvements to that are coming up next.

There are also new techniques coming out all the time to make models faster, and by collaborating with the community you can be sure they’re going to be on Replicate as fast as possible. Stay tuned.