Readme
SmolLM3-3B Pruna Optimized
This setup uses Pruna with torch.compile and HQQ quantization to package the SmolLM3-3B model, enabling faster, memory-efficient text generation with reduced VRAM usage while preserving output quality.
-
Quantization (HQQ) Reduces memory footprint and compute cost while maintaining output quality.
-
Torch Compile Leverages
torch.compile
to accelerate inference by fusing operations and optimizing execution.
Features
- Dynamic text generation — supports reasoning or non-reasoning modes (
think
/no_think
). - Optimized inference — quantization + torch.compile give faster response with reduced GPU memory.
- Configurable outputs — control
max_new_tokens
(up to 16k), seed and mode. - Output saving — automatically stores generated responses as
.txt
files.
Use Cases
- Chatbots & assistants → create interactive, lightweight agents that run efficiently.
- Idea generation → draft content such as blogs, poems, or creative writing.
- Prototyping reasoning tasks → test
think
vsno_think
modes for chain-of-thought reasoning. - On-device deployment → smaller memory footprint makes it suitable for constrained GPU environments.