LoRA Pivotal Tuning Inversion Training

Model description

There are many methods to fine-tune Stable diffusion models. One can use Low-rank adaption with pivotal-tuning inversion to achieve high-editable, efficient fine-tuning. Output models can be used with Replicate’s LoRA for inference.

If you don’t want to set all of the hyperparameters yourself, you can use https://replicate.com/cloneofsimo/lora-training which has presets for faces, objects, and styles.

Ethical considerations

Do not use this model to produce harmful results. As this method strictly utilizes stable-diffusion, ethical considerations that were addressed in BigScience OpenRAIL-M license should be addressed here as well.

Caveats and recommendations

Use many, diverse, high quality dataset. Any blur, noises and artifacts will have negative effect to the training process. Having different lighting conditions, shapes, angles, and various sizes will help very much.
Images will be resized and cropped to 512 x 512 by default, thus it is recommended to prepare datasets with larger than 512 x 512.
Using face template requires all input images to have human face, and only one per image. For example, it will not work with animal faces, or highly unhuman-like character faces.

Advacned Argument Documentation

instance_data: A ZIP file containing your training images (JPG, PNG, etc. size not restricted). These images contain your ‘subject’ that you want the trained model to embed in the output domain for later generating customized scenes beyond the training images. For best results, use images without noise or unrelated objects in the background. (Type: Path, Default: None)
seed: A seed for reproducible training (Type: int, Default: 1337)
resolution: The resolution for input images. All the images in the train/validation dataset will be resized to this resolution. (Type: int, Default: 512)
train_text_encoder: Whether to train the text encoder. (Type: bool, Default: True)
train_batch_size: Batch size (per device) for the training dataloader. (Type: int, Default: 1)
gradient_accumulation_steps: Number of updates steps to accumulate before performing a backward/update pass. (Type: int, Default: 4)
gradient_checkpointing: Whether or not to use gradient checkpointing to save memory at the expense of slower backward pass. (Type: bool, Default: False)
scale_lr: Scale the learning rate by the number of GPUs, gradient accumulation steps, and batch size. (Type: bool, Default: True)
lr_scheduler: The scheduler type to use. (Type: str, Choices: [“linear”, “cosine”, “cosine_with_restarts”, “polynomial”, “constant”, “constant_with_warmup”], Default: “constant”)
lr_warmup_steps: Number of steps for the warmup in the lr scheduler. (Type: int, Default: 0)
clip_ti_decay: Whether or not to perform Bayesian Learning Rule on norm of the CLIP latent. (Type: bool, Default: True)
color_jitter: Whether or not to use color jitter at augmentation. (Type: bool, Default: True)
continue_inversion: Whether or not to continue inversion. (Type: bool, Default: False)
continue_inversion_lr: The learning rate for continuing an inversion. (Type: float, Default: 1e-4)
initializer_tokens: The tokens to use for the initializer. If not provided, will randomly initialize from gaussian N(0,0.017^2)
learning_rate_text, learning_rate_ti, learning_rate_unet, Learning rate for Text Encoder, Textual Embedding, Unet respectively. Recommended values : 1e-5, 5e-4, 1e-4.
lora_rank, Rank of the LoRA. Larger it is, more likely to capture fidelity but less likely to be editable. Larger rank will make the end result larger. (Type: int, Default: 4)
lora_dropout_p, Dropout for the LoRA layer. Reference [1] (Type: float, Default: 0.1)
lora_scale, Scaling parameter at the end of the LoRA layer. Reference [1] (Type: float, Default: 1.0)
lr_scheduler_lora: LR Scheduler for LoRA. (Type: str, Default: “constant”)

Choices: “linear”, “cosine”, “cosine_with_restarts”, “polynomial”, “constant”, “constant_with_warmup”

lr_warmup_steps_lora: Number of steps for the warmup in the LR scheduler. (Type: int, Default: 0)
max_train_steps_ti: The maximum number of training steps for the TI. (Type: int, Default: 500)
max_train_steps_tuning: The maximum number of training steps for the tuning. (Type: int, Default: 1000)
placeholder_token_at_data: If this value is provided as “X|Y”, it will transform target word X into Y at caption. You are required to provide caption as filename (not regarding extension), and Y has to contain placeholder token below. You are also required to set None for use_template argument to use this feature. (Type: str, Default: None)
placeholder_tokens: The placeholder tokens to use for the initializer. (Type: str, Default: “<s1>|<s2>“)
use_face_segmentation_condition: Whether or not to use the face segmentation condition. (Type: bool, Default: False)
use_template: The template to use for the inversion. (Type: str, Default: “object”)

Choices: “object”, “style”, “none”

weight_decay_lora: The weight decay for the LORA loss. (Type: float, Default: 0.001)
weight_decay_ti: The weight decay for the TI. (Type: float, Default: 0.00)

[1] : Hu, Edward J., et al. “Lora: Low-rank adaptation of large language models.” arXiv preprint arXiv:2106.09685 (2021).

Model created over 1 year ago