camenduru / story-diffusion

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

  • Public
  • 1K runs
  • L40S
  • GitHub
  • Paper
  • License

Input

input_image
*file
string
Shift + Return to add a new line

Style template: '(No style)', 'Japanese Anime', 'Cinematic', 'Disney Character', 'Photographic', 'Comic book', 'Line art'

Default: "Japanese Anime"

integer

Style strength of Ref Image (%)

Default: 20

string

Control type of the Character

Default: "Using Ref Images"

string
Shift + Return to add a new line

Textual Description for Character

Default: "a woman img, wearing a white T-shirt, blue loose hair"

string
Shift + Return to add a new line

Negative Prompt

Default: "bad anatomy, bad hands, missing fingers, extra fingers, three hands, three legs, bad arms, missing legs, missing arms, poorly drawn face, bad face, fused face, cloned face, three crus, fused feet, fused thigh, extra crus, ugly fingers, horn, cartoon, cg, 3d, unreal, animate, amputation, disconnected limbs"

string
Shift + Return to add a new line

Comic Description (each line corresponds to a frame).

Default: "wake up in the bed\nhave breakfast\nis on the road, go to company\nwork in the company\nTake a walk next to the company at noon\nlying in bed at night"

number

Ip Adapter Strength

Default: 0.5

number

The degree of Paired Attention at 32 x 32 self-attention layers

Default: 0.5

number

The degree of Paired Attention at 64 x 64 self-attention layers

Default: 0.5

integer

Number of id images in total images

Default: 3

integer

Seed

Default: 1

integer

Number of sample steps

Default: 50

integer

Height

Default: 768

integer

Width

Default: 768

string

Typesetting Style

Default: "Classic Comic Style"

integer

Guidance scale

Default: 5

Output

outputoutputoutputoutputoutputoutputoutput
Generated in

This example was created by a different version, camenduru/story-diffusion:d9d04c7d.

Run time and cost

This model costs approximately $0.11 to run on Replicate, or 9 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 111 seconds. The predict time for this model varies significantly based on the inputs.

Readme

🐣 Please follow me for new updates https://twitter.com/camenduru
🔥 Please join our discord server https://discord.gg/k5BwmmvJJU
🥳 Please join my patreon community https://patreon.com/camenduru

📋 Tutorial

  • Enter a Textual Description for Character, if you add the Ref-Image, making sure to follow the class word you want to customize with the trigger word: img, such as: man img or woman img or girl img.
  • Enter the prompt array, each line corrsponds to one generated image.
  • Choose your preferred style template.
  • If you need to change the caption, add a # at the end of each line. Only the part after the # will be added as a caption to the image.)
  • [NC] symbol (The [NC] symbol is used as a flag to indicate that no characters should be present in the generated scene images. If you want do that, prepend the “[NC]” at the beginning of the line. For example, to generate a scene of falling leaves without any character, write: “[NC] The leaves are falling.”), Currently, support is only using Textual Description

🕸 Replicate

https://replicate.com/camenduru/story-diffusion

🧬 Code

https://github.com/HVision-NKU/StoryDiffusion

📄 Paper

https://arxiv.org/abs/2405.01434

🌐 Page

https://storydiffusion.github.io/

https://replicate.com