paragekbote/smollm3-3b-smashed | Run with an API on Replicate

Readme

SmolLM3-3B Pruna Optimized

This setup uses Pruna with torch.compile and HQQ quantization to package the SmolLM3-3B model, enabling faster, memory-efficient text generation with reduced VRAM usage while preserving output quality.

Quantization (HQQ) Reduces memory footprint and compute cost while maintaining output quality.
Torch Compile Leverages torch.compile to accelerate inference by fusing operations and optimizing execution.

Features

Dynamic text generation — supports reasoning or non-reasoning modes (think / no_think).
Optimized inference — quantization + torch.compile give faster response with reduced GPU memory.
Configurable outputs — control max_new_tokens (up to 16k), seed and mode.
Output saving — automatically stores generated responses as .txt files.

Use Cases

Chatbots & assistants → create interactive, lightweight agents that run efficiently.
Idea generation → draft content such as blogs, poems, or creative writing.
Prototyping reasoning tasks → test think vs no_think modes for chain-of-thought reasoning.
On-device deployment → smaller memory footprint makes it suitable for constrained GPU environments.

Model created 3 months, 2 weeks ago

Run time and cost

Readme

Features

Use Cases