alicewuv/whiskii-chat

Fine-Tuned Qwen2.5-7B-Instruct

Public
147 runs

Run time and cost

This model costs approximately $0.0054 to run on Replicate, or 185 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 4 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Whiskii-chat ( Based On Qwen2.5-7B-Instruct )

Model details

  • Base: Qwen2.5-7B-Instruct (fine-tuned, uncensored variant)
  • Parameters: \~7.6B
  • Source: huggingface.co/Qwen/Qwen2.5-7B-Instruct
  • License: GPL-3.0 (inherits from upstream repo)
  • Chat formatting: Uses tokenizer’s apply_chat_template() for Qwen chat format.

⚠️ Safety & compliance: This is an uncensored model. If you publish it, configure external moderation/guardrails as needed and comply with Replicate’s policies and the model’s license.


Inputs

Name Type Default Range Description
prompt string — (required) User message/content to generate from.
system_prompt string "You are a helpful assistant." Optional system/behavior instruction, placed before user content.
max_new_tokens integer 512 14096 Maximum new tokens to generate.
temperature float 0.7 0.02.0 Sampling temperature; set 0 for greedy.
top_p float 0.9 0.01.0 Nucleus sampling (Top-p).
repetition_penalty float 1.05 0.82.0 Penalty to reduce repetition.
stop string (token) null Optional single token used as eos_token_id.
n integer 1 14 Number of candidates to generate. Multiple candidates are concatenated with separators in the single-string output.
seed integer null Optional RNG seed for reproducibility.

Environment knobs

  • LOAD_IN_8BIT=1 — load weights in 8-bit via bitsandbytes (helps fit smaller GPUs).

Output

  • Type: string
  • Behavior:

  • Returns a plain string when n = 1.

  • When n > 1, multiple candidates are joined with \n\n---\n\n between them.
  • The code trims content prior to the \nassistant\n marker (Qwen chat template) if present.

Usage

Python

import replicate

client = replicate.Client(api_token="<REPLICATE_API_TOKEN>")
version = "lilcats/whiskii-chat:<version>"

out = client.run(version, input={
    "prompt": "Write a playful limerick about cats and cloud GPUs.",
    "temperature": 0.6,
    "max_new_tokens": 120,
})
print(out)

JavaScript

import Replicate from "replicate";
const replicate = new Replicate({ auth: process.env.REPLICATE_API_TOKEN });

const output = await replicate.run("lilcats/whiskii-chat:<version>", {
  input: {
    prompt: "Create a short onboarding message for new beta users.",
    temperature: 0.5,
    max_new_tokens: 120,
  },
});

console.log(output);

Example prompts

  • “Draft a product announcement (150 words) for a new markdown note editor with AI autocomplete.”
  • “Explain transformers to a 10-year-old in 5 sentences.”
  • “Rewrite this paragraph to be more concise: <paste text>.”

Limitations

  • As an uncensored model, output may be unfiltered; add external moderation if deploying publicly.
  • The optional stop input expects a single token (not a full string sequence). For substring stops, add post-processing.
  • Streaming tokens are not enabled in this template (can be added later).