[ PROMPT_NODE_22831 ]
Optimization Hqq – Troubleshooting
[ SKILL_DOCUMENTATION ]
# HQQ Troubleshooting Guide
## Installation Issues
### Package Not Found
**Error**: `ModuleNotFoundError: No module named 'hqq'`
**Fix**:
```bash
pip install hqq
# Verify installation
python -c "import hqq; print(hqq.__version__)"
```
### Backend Dependencies Missing
**Error**: `ImportError: Cannot import marlin backend`
**Fix**:
```bash
# Install specific backend
pip install hqq[marlin]
# Or all backends
pip install hqq[all]
# For BitBlas
pip install bitblas
# For TorchAO
pip install torchao
```
### CUDA Version Mismatch
**Error**: `RuntimeError: CUDA error: no kernel image is available`
**Fix**:
```bash
# Check CUDA version
nvcc --version
python -c "import torch; print(torch.version.cuda)"
# Reinstall PyTorch with matching CUDA
pip install torch --index-url https://download.pytorch.org/whl/cu121
# Then reinstall hqq
pip install hqq --force-reinstall
```
## Quantization Errors
### Out of Memory During Quantization
**Error**: `torch.cuda.OutOfMemoryError`
**Solutions**:
1. **Use CPU offloading**:
```python
from transformers import AutoModelForCausalLM, HqqConfig
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=HqqConfig(nbits=4, group_size=64),
device_map="auto",
offload_folder="./offload"
)
```
2. **Quantize layer by layer**:
```python
from hqq.models.hf.base import AutoHQQHFModel
model = AutoHQQHFModel.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=config,
device_map="sequential"
)
```
3. **Reduce group size**:
```python
config = HqqConfig(
nbits=4,
group_size=32 # Smaller groups use less memory during quantization
)
```
### NaN Values After Quantization
**Error**: `RuntimeWarning: invalid value encountered` or NaN outputs
**Solutions**:
1. **Check for outliers**:
```python
import torch
def check_weight_stats(model):
for name, param in model.named_parameters():
if param.numel() > 0:
has_nan = torch.isnan(param).any().item()
has_inf = torch.isinf(param).any().item()
if has_nan or has_inf:
print(f"{name}: NaN={has_nan}, Inf={has_inf}")
print(f" min={param.min():.4f}, max={param.max():.4f}")
check_weight_stats(model)
```
2. **Use higher precision for problematic layers**:
```python
layer_configs = {
"problematic_layer": BaseQuantizeConfig(nbits=8, group_size=128),
"default": BaseQuantizeConfig(nbits=4, group_size=64)
}
```
3. **Skip embedding/lm_head**:
```python
config = HqqConfig(
nbits=4,
group_size=64,
skip_modules=["embed_tokens", "lm_head"]
)
```
### Wrong Output Shape
**Error**: `RuntimeError: shape mismatch`
**Fix**:
```python
# Ensure axis is correct for your model
config = BaseQuantizeConfig(
nbits=4,
group_size=64,
axis=1 # Usually 1 for most models, try 0 if issues
)
```
## Backend Issues
### Marlin Backend Not Working
**Error**: `RuntimeError: Marlin kernel not available`
**Requirements**:
- Ampere (A100) or newer GPU (compute capability >= 8.0)
- 4-bit quantization only
- Group size must be 128
**Fix**:
```python
# Check GPU compatibility
import torch
device = torch.cuda.get_device_properties(0)
print(f"Compute capability: {device.major}.{device.minor}")
# Marlin requires >= 8.0
if device.major >= 8:
HQQLinear.set_backend("marlin")
else:
HQQLinear.set_backend("aten") # Fallback
```
### TorchAO Backend Errors
**Error**: `ImportError: torchao not found`
**Fix**:
```bash
pip install torchao
# Verify
python -c "import torchao; print('TorchAO installed')"
```
**Error**: `RuntimeError: torchao int4 requires specific shapes`
**Fix**:
```python
# TorchAO int4 has shape requirements
# Ensure dimensions are divisible by 32
config = BaseQuantizeConfig(
nbits=4,
group_size=64 # Must be power of 2
)
```
### Fallback to PyTorch Backend
```python
from hqq.core.quantize import HQQLinear
def safe_set_backend(preferred_backend):
"""Set backend with fallback."""
try:
HQQLinear.set_backend(preferred_backend)
print(f"Using {preferred_backend} backend")
except Exception as e:
print(f"Failed to set {preferred_backend}: {e}")
print("Falling back to pytorch backend")
HQQLinear.set_backend("pytorch")
safe_set_backend("marlin")
```
## Performance Issues
### Slow Inference
**Problem**: Inference slower than expected
**Solutions**:
1. **Use optimized backend**:
```python
from hqq.core.quantize import HQQLinear
# Try backends in order of speed
for backend in ["marlin", "torchao_int4", "aten", "pytorch_compile"]:
try:
HQQLinear.set_backend(backend)
print(f"Using {backend}")
break
except:
continue
```
2. **Enable torch.compile**:
```python
import torch
model = torch.compile(model, mode="reduce-overhead")
```
3. **Use CUDA graphs** (for fixed input shapes):
```python
# Warmup
for _ in range(3):
model.generate(**inputs, max_new_tokens=100)
# Enable CUDA graphs
torch.cuda.synchronize()
```
### High Memory Usage During Inference
**Problem**: Memory usage higher than expected for quantized model
**Solutions**:
1. **Clear KV cache**:
```python
# Use past_key_values management
outputs = model.generate(
**inputs,
max_new_tokens=100,
use_cache=True,
return_dict_in_generate=True
)
# Clear after use
del outputs.past_key_values
torch.cuda.empty_cache()
```
2. **Reduce batch size**:
```python
# Process in smaller batches
batch_size = 4 # Reduce if OOM
for i in range(0, len(prompts), batch_size):
batch = prompts[i:i+batch_size]
outputs = model.generate(...)
torch.cuda.empty_cache()
```
3. **Use gradient checkpointing** (for training):
```python
model.gradient_checkpointing_enable()
```
## Quality Issues
### Poor Generation Quality
**Problem**: Quantized model produces gibberish or low-quality output
**Solutions**:
1. **Increase precision**:
```python
# Try higher bit-width
config = HqqConfig(nbits=8, group_size=128) # Start high
# Then gradually reduce: 8 -> 4 -> 3 -> 2
```
2. **Use smaller group size**:
```python
config = HqqConfig(
nbits=4,
group_size=32 # Smaller = more accurate, more memory
)
```
3. **Skip sensitive layers**:
```python
config = HqqConfig(
nbits=4,
group_size=64,
skip_modules=["embed_tokens", "lm_head", "model.layers.0"]
)
```
4. **Compare outputs**:
```python
def compare_outputs(original_model, quantized_model, prompt):
"""Compare outputs between original and quantized."""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
orig_out = original_model.generate(**inputs, max_new_tokens=50)
quant_out = quantized_model.generate(**inputs, max_new_tokens=50)
print("Original:", tokenizer.decode(orig_out[0]))
print("Quantized:", tokenizer.decode(quant_out[0]))
```
### Perplexity Degradation
**Problem**: Significant perplexity increase after quantization
**Diagnosis**:
```python
import torch
from datasets import load_dataset
def measure_perplexity(model, tokenizer, dataset_name="wikitext", split="test"):
"""Measure model perplexity."""
dataset = load_dataset(dataset_name, "wikitext-2-raw-v1", split=split)
text = "nn".join(dataset["text"])
encodings = tokenizer(text, return_tensors="pt")
max_length = 2048
stride = 512
nlls = []
for i in range(0, encodings.input_ids.size(1), stride):
begin = max(i + stride - max_length, 0)
end = min(i + stride, encodings.input_ids.size(1))
input_ids = encodings.input_ids[:, begin:end].to(model.device)
target_ids = input_ids.clone()
target_ids[:, :-stride] = -100
with torch.no_grad():
outputs = model(input_ids, labels=target_ids)
nlls.append(outputs.loss)
ppl = torch.exp(torch.stack(nlls).mean())
return ppl.item()
# Compare
orig_ppl = measure_perplexity(original_model, tokenizer)
quant_ppl = measure_perplexity(quantized_model, tokenizer)
print(f"Original PPL: {orig_ppl:.2f}")
print(f"Quantized PPL: {quant_ppl:.2f}")
print(f"Degradation: {((quant_ppl - orig_ppl) / orig_ppl * 100):.1f}%")
```
## Integration Issues
### HuggingFace Integration Errors
**Error**: `ValueError: Unknown quantization method: hqq`
**Fix**:
```bash
# Update transformers
pip install -U transformers>=4.36.0
```
**Error**: `AttributeError: 'HqqConfig' object has no attribute`
**Fix**:
```python
from transformers import HqqConfig
# Use correct parameter names
config = HqqConfig(
nbits=4, # Not 'bits'
group_size=64, # Not 'groupsize'
axis=1 # Not 'quant_axis'
)
```
### vLLM Integration Issues
**Error**: `ValueError: HQQ quantization not supported`
**Fix**:
```bash
# Update vLLM
pip install -U vllm>=0.3.0
```
**Usage**:
```python
from vllm import LLM
# Load pre-quantized model
llm = LLM(
model="mobiuslabsgmbh/Llama-3.1-8B-HQQ-4bit",
quantization="hqq"
)
```
### PEFT Integration Issues
**Error**: `RuntimeError: Cannot apply LoRA to quantized layer`
**Fix**:
```python
from peft import prepare_model_for_kbit_training
# Prepare model for training
model = prepare_model_for_kbit_training(model)
# Then apply LoRA
model = get_peft_model(model, lora_config)
```
## Debugging Tips
### Enable Verbose Logging
```python
import logging
logging.basicConfig(level=logging.DEBUG)
logging.getLogger("hqq").setLevel(logging.DEBUG)
```
### Verify Quantization Applied
```python
def verify_quantization(model):
"""Check if model is properly quantized."""
from hqq.core.quantize import HQQLinear
total_linear = 0
quantized_linear = 0
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
total_linear += 1
elif isinstance(module, HQQLinear):
quantized_linear += 1
print(f"Quantized: {name} ({module.W_q.dtype}, {module.W_q.shape})")
print(f"nTotal Linear: {total_linear}")
print(f"Quantized: {quantized_linear}")
print(f"Ratio: {quantized_linear / max(total_linear + quantized_linear, 1) * 100:.1f}%")
verify_quantization(model)
```
### Memory Profiling
```python
import torch
def profile_memory():
"""Profile GPU memory usage."""
print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Reserved: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
print(f"Max Allocated: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
# Before quantization
profile_memory()
# After quantization
model = load_quantized_model(...)
profile_memory()
```
## Getting Help
1. **GitHub Issues**: https://github.com/mobiusml/hqq/issues
2. **HuggingFace Forums**: https://discuss.huggingface.co
3. **Discord**: Check HQQ community channels
### Reporting Issues
Include:
- HQQ version: `pip show hqq`
- PyTorch version: `python -c "import torch; print(torch.__version__)"`
- CUDA version: `nvcc --version`
- GPU model: `nvidia-smi --query-gpu=name --format=csv`
- Full error traceback
- Minimal reproducible code
Source: claude-code-templates (MIT). See About Us for full credits.