paragekbote/gemma3-torchao-quant-sparse

An optimized gemma-3-4b setup with INT8 weight-only quantization, torch_compile and sparsity for efficient inference.

Public
12 runs

gemma3-torchao-quant-sparse

It is a performance-optimized adaptation of the google/gemma-3-4b-it multimodal LLM. This project integrates advanced techniques for memory and compute efficiency while preserving strong generative capabilities for image-text tasks.


Key Features

  • INT8 Weight-Only Quantization
  • Uses torchao’s Int8WeightOnlyConfig for weight-only INT8 quantization.
  • Reduces VRAM usage significantly and speeds up inference.
  • Maintains model output fidelity while reducing memory footprint.

  • Advanced Sparsity Techniques

  • Magnitude-based pruning:

    • Zeroes out the smallest weights in linear layers based on absolute magnitude.
    • Safely reduces parameter count without modifying layer shapes.
  • Gradual pruning:

    • Gradually increases sparsity over multiple steps (configurable, e.g., 1000 steps).
    • Prevents sudden degradation in output quality.
    • Ideal for testing safe sparsity ratios before applying aggressive pruning.
  • Structured sparsity (safe / least-breaking):

    • Targets entire channels (output neurons) instead of individual weights.
    • Uses L2 norm per channel to identify least important channels.
    • Zeroes channels without physically removing them to preserve layer shapes.
    • Ensures compatibility with downstream layers and avoids errors.
  • Flexible sparsity ratios:

    • Supports low ratios (1–2%) for minimal impact and high ratios (up to 35%) for aggressive optimization.
    • Gradual sparsity can be combined with structured or magnitude pruning for maximum flexibility.
  • Filter map:

    • Excludes critical layers such as embeddings, normalization layers, and output heads from pruning.
    • Ensures that pruning does not break model outputs.
    1. Selective Torch Compile
  • Only critical layers are torch_compiled for faster execution.
  • Reduces compilation overhead while improving inference speed.
  • Additional layers can be compiled selectively if safe, excluding layers like normalization or output projections.