paragekbote/gemma3-torchao-quant-sparse

An optimized gemma-3-4b setup with INT8 weight-only quantization, torch_compile and sparsity for efficient inference.

Public
32 runs

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

gemma3-torchao-quant-sparse

A performance-optimized adaptation of the google/gemma-3-4b-it multimodal LLM, which integrates memory and compute efficiency techniques while preserving high-quality generative capabilities for image-text tasks.


Key Features

1. INT8 Weight-Only Quantization

  • Uses torchao’s Int8WeightOnlyConfig for weight-only INT8 quantization.
  • Reduces VRAM usage significantly and speeds up inference.
  • Maintains model output fidelity while lowering memory footprint.

2. Sparsity Techniques

2.1 Magnitude-Based Pruning

  • Zeroes out the smallest weights in linear layers based on absolute magnitude.
  • Safely reduces parameter count without altering layer shapes.

2.2 Gradual Pruning

  • Incrementally increases sparsity over multiple steps (e.g., 1000 steps).
  • Prevents sudden degradation in output quality.
  • Ideal for safe exploration of sparsity ratios.

2.3 Layer-Norm-Based Pruning

  • Prunes weights in linear layers with lowest L2 norms per layer.
  • Extremely fast because it does not require forward passes.
  • Preserves output quality and keeps layer shapes intact, ensuring seamless integration with the model.

2.4 Flexible Sparsity Ratios

  • Supports low ratios (1–2%) for minimal impact and high ratios (up to 80%) for aggressive optimization.

2.5 Filter Map

  • Excludes critical layers such as embeddings, normalization layers and output heads from pruning.
  • Ensures that pruning does not break model outputs and preserves output quality.

3. Selective Torch Compile

  • Only critical layers are compiled with torch.compile for faster execution.
  • Reduces compilation overhead while improving inference speed.
  • Additional layers can be compiled selectively if required.

Benefits

  • Significantly lower VRAM usage (~5–6 GB for 4B model at 500–750 tokens).
  • Faster inference with Torch Compile + INT8 quantization.
  • Safe sparsity techniques allow memory-efficient experiments without breaking the model.
  • Maintains high-quality multimodal generation (image-to-text and text-to-text).