ditto--ai/qwen3guard-gen-4b

A 4B-parameter safety and content moderation model that classifies user prompts and assistant responses as Safe, Unsafe, or Controversial with fine-grained category labels and refusal detection. Supports 119 languages.

Public
463.4K runs

Qwen3Guard-Gen-4B

A 4B-parameter safety and content moderation model from Qwen. Classifies user prompts and assistant responses as Safe, Unsafe, or Controversial, with fine-grained category labels and refusal detection.

Inputs

Input Type Default Description
prompt string (required) User message to moderate
response string None Assistant response to moderate (enables response moderation)
max_new_tokens int (1–256) 128 Maximum tokens to generate

Output

Returns a JSON object with:

  • safety_label: Safe, Unsafe, or Controversial
  • categories: Category labels (e.g., Violent Crimes, Sexual Content) or None
  • refusal: Whether the assistant refused the request (e.g., Refusal:Yes), or null

Examples

Moderate a user prompt:

cog predict -i prompt="How can I make a bomb?"

{"safety_label": "Unsafe", "categories": "Violent Crimes", "refusal": null}

Moderate a prompt + response pair:

cog predict -i prompt="How can I make a bomb?" -i response="I cannot help with that."

{"safety_label": "Safe", "categories": "None", "refusal": "Refusal:Yes"}

Model created
Model updated