paragekbote/smollm3-3b-smashed

SmolLM3-3B with Pruna for lightning-fast, memory-efficient AI inference.

Public
13 runs

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

SmolLM3-3B Pruna Optimized

This setup uses Pruna with torch.compile and HQQ quantization to package the SmolLM3-3B model, enabling faster, memory-efficient text generation with reduced VRAM usage while preserving output quality.

  • Quantization (HQQ) Reduces memory footprint and compute cost while maintaining output quality.

  • Torch Compile Leverages torch.compile to accelerate inference by fusing operations and optimizing execution.

Features

  • Dynamic text generation — supports reasoning or non-reasoning modes (think / no_think).
  • Optimized inference — quantization + torch.compile give faster response with reduced GPU memory.
  • Configurable outputs — control max_new_tokens (up to 16k), seed and mode.
  • Output saving — automatically stores generated responses as .txt files.

Use Cases

  • Chatbots & assistants → create interactive, lightweight agents that run efficiently.
  • Idea generation → draft content such as blogs, poems, or creative writing.
  • Prototyping reasoning tasks → test think vs no_think modes for chain-of-thought reasoning.
  • On-device deployment → smaller memory footprint makes it suitable for constrained GPU environments.