paragekbote/gemma3-torchao-quant-sparse

An optimized gemma-3-4b setup with INT8 weight-only quantization, torch_compile and sparsity for efficient inference.

Public
12 runs

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

gemma3-torchao-quant-sparse

It is a performance-optimized adaptation of the google/gemma-3-4b-it multimodal LLM. This project integrates advanced techniques for memory and compute efficiency while preserving strong generative capabilities for image-text tasks.


Key Features

  • INT8 Weight-Only Quantization
  • Uses torchao’s Int8WeightOnlyConfig for weight-only INT8 quantization.
  • Reduces VRAM usage significantly and speeds up inference.
  • Maintains model output fidelity while reducing memory footprint.

  • Advanced Sparsity Techniques

  • Magnitude-based pruning:

    • Zeroes out the smallest weights in linear layers based on absolute magnitude.
    • Safely reduces parameter count without modifying layer shapes.
  • Gradual pruning:

    • Gradually increases sparsity over multiple steps (configurable, e.g., 1000 steps).
    • Prevents sudden degradation in output quality.
    • Ideal for testing safe sparsity ratios before applying aggressive pruning.
  • Structured sparsity (safe / least-breaking):

    • Targets entire channels (output neurons) instead of individual weights.
    • Uses L2 norm per channel to identify least important channels.
    • Zeroes channels without physically removing them to preserve layer shapes.
    • Ensures compatibility with downstream layers and avoids errors.
  • Flexible sparsity ratios:

    • Supports low ratios (1–2%) for minimal impact and high ratios (up to 35%) for aggressive optimization.
    • Gradual sparsity can be combined with structured or magnitude pruning for maximum flexibility.
  • Filter map:

    • Excludes critical layers such as embeddings, normalization layers, and output heads from pruning.
    • Ensures that pruning does not break model outputs.
    1. Selective Torch Compile
  • Only critical layers are torch_compiled for faster execution.
  • Reduces compilation overhead while improving inference speed.
  • Additional layers can be compiled selectively if safe, excluding layers like normalization or output projections.