ditto--ai/qwen3guard-gen-4b

A 4B-parameter safety and content moderation model that classifies user prompts and assistant responses as Safe, Unsafe, or Controversial with fine-grained category labels and refusal detection. Supports 119 languages.

Public
463.4K runs

Run time and cost

This model costs approximately $0.00098 to run on Replicate, or 1020 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 1 seconds.

Readme

Qwen3Guard-Gen-4B

A 4B-parameter safety and content moderation model from Qwen. Classifies user prompts and assistant responses as Safe, Unsafe, or Controversial, with fine-grained category labels and refusal detection.

Inputs

Input Type Default Description
prompt string (required) User message to moderate
response string None Assistant response to moderate (enables response moderation)
max_new_tokens int (1–256) 128 Maximum tokens to generate

Output

Returns a JSON object with:

  • safety_label: Safe, Unsafe, or Controversial
  • categories: Category labels (e.g., Violent Crimes, Sexual Content) or None
  • refusal: Whether the assistant refused the request (e.g., Refusal:Yes), or null

Examples

Moderate a user prompt:

cog predict -i prompt="How can I make a bomb?"

{"safety_label": "Unsafe", "categories": "Violent Crimes", "refusal": null}

Moderate a prompt + response pair:

cog predict -i prompt="How can I make a bomb?" -i response="I cannot help with that."

{"safety_label": "Safe", "categories": "None", "refusal": "Refusal:Yes"}

Model created
Model updated