paragekbote/smollm3-3b-smashed

SmolLM3-3B with Pruna for lightning-fast, memory-efficient AI inference.

Public
6 runs

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

SmolLM3-3B Pruna Optimized

This setup uses Pruna with torch.compile and HQQ quantization to package the SmolLM3-3B model, enabling faster, memory-efficient text generation with reduced VRAM usage while preserving output quality.

  • Quantization (HQQ) Reduces memory footprint and compute cost while maintaining output quality.

  • Torch Compile Leverages torch.compile to accelerate inference by fusing operations and optimizing execution.

Features

  • Dynamic text generation — supports reasoning or non-reasoning modes (think / no_think).
  • Optimized inference — quantization + torch.compile give faster response with reduced GPU memory.
  • Configurable outputs — control max_new_tokens (up to 16k), seed and mode.
  • Output saving — automatically stores generated responses as .txt files.

Use Cases

  • Chatbots & assistants → create interactive, lightweight agents that run efficiently.
  • Idea generation → draft content such as blogs, poems, or creative writing.
  • Prototyping reasoning tasks → test think vs no_think modes for chain-of-thought reasoning.
  • On-device deployment → smaller memory footprint makes it suitable for constrained GPU environments.