gemma3-torchao-quant-sparse
It is a performance-optimized adaptation of the google/gemma-3-4b-it
multimodal LLM. This project integrates advanced techniques for memory and compute efficiency while preserving strong generative capabilities for image-text tasks.
Key Features
- INT8 Weight-Only Quantization
- Uses
torchao
’sInt8WeightOnlyConfig
for weight-only INT8 quantization. - Reduces VRAM usage significantly and speeds up inference.
-
Maintains model output fidelity while reducing memory footprint.
-
Advanced Sparsity Techniques
-
Magnitude-based pruning:
- Zeroes out the smallest weights in linear layers based on absolute magnitude.
- Safely reduces parameter count without modifying layer shapes.
-
Gradual pruning:
- Gradually increases sparsity over multiple steps (configurable, e.g., 1000 steps).
- Prevents sudden degradation in output quality.
- Ideal for testing safe sparsity ratios before applying aggressive pruning.
-
Structured sparsity (safe / least-breaking):
- Targets entire channels (output neurons) instead of individual weights.
- Uses L2 norm per channel to identify least important channels.
- Zeroes channels without physically removing them to preserve layer shapes.
- Ensures compatibility with downstream layers and avoids errors.
-
Flexible sparsity ratios:
- Supports low ratios (1–2%) for minimal impact and high ratios (up to 35%) for aggressive optimization.
- Gradual sparsity can be combined with structured or magnitude pruning for maximum flexibility.
-
Filter map:
- Excludes critical layers such as embeddings, normalization layers, and output heads from pruning.
- Ensures that pruning does not break model outputs.
-
- Selective Torch Compile
- Only critical layers are torch_compiled for faster execution.
- Reduces compilation overhead while improving inference speed.
- Additional layers can be compiled selectively if safe, excluding layers like normalization or output projections.