[ PROMPT_NODE_22491 ]
Inference Serving VLLM – Troubleshooting
[ SKILL_DOCUMENTATION ]
# Troubleshooting Guide
## Contents
- Out of memory (OOM) errors
- Performance issues
- Model loading errors
- Network and connection issues
- Quantization problems
- Distributed serving issues
- Debugging tools and commands
## Out of memory (OOM) errors
### Symptom: `torch.cuda.OutOfMemoryError` during model loading
**Cause**: Model + KV cache exceeds available VRAM
**Solutions (try in order)**:
1. **Reduce GPU memory utilization**:
```bash
vllm serve MODEL --gpu-memory-utilization 0.7 # Try 0.7, 0.75, 0.8
```
2. **Reduce max sequence length**:
```bash
vllm serve MODEL --max-model-len 4096 # Instead of 8192
```
3. **Enable quantization**:
```bash
vllm serve MODEL --quantization awq # 4x memory reduction
```
4. **Use tensor parallelism** (multiple GPUs):
```bash
vllm serve MODEL --tensor-parallel-size 2 # Split across 2 GPUs
```
5. **Reduce max concurrent sequences**:
```bash
vllm serve MODEL --max-num-seqs 128 # Default is 256
```
### Symptom: OOM during inference (not model loading)
**Cause**: KV cache fills up during generation
**Solutions**:
```bash
# Reduce KV cache allocation
vllm serve MODEL --gpu-memory-utilization 0.85
# Reduce batch size
vllm serve MODEL --max-num-seqs 64
# Reduce max tokens per request
# Set in client request: max_tokens=512
```
### Symptom: OOM with quantized model
**Cause**: Quantization overhead or incorrect configuration
**Solution**:
```bash
# Ensure quantization flag matches model
vllm serve TheBloke/Llama-2-70B-AWQ --quantization awq # Must specify
# Try different dtype
vllm serve MODEL --quantization awq --dtype float16
```
## Performance issues
### Symptom: Low throughput (100)
**Diagnostic steps**:
1. **Check GPU utilization**:
```bash
watch -n 1 nvidia-smi
# GPU utilization should be >80%
```
If <80%, increase concurrent requests:
```bash
vllm serve MODEL --max-num-seqs 512 # Increase from 256
```
2. **Check if memory-bound**:
```bash
# If memory at 100% but GPU 1 second)
**Causes and solutions**:
**Long prompts**:
```bash
vllm serve MODEL --enable-chunked-prefill
```
**No prefix caching**:
```bash
vllm serve MODEL --enable-prefix-caching # For repeated prompts
```
**Too many concurrent requests**:
```bash
vllm serve MODEL --max-num-seqs 64 # Reduce to prioritize latency
```
**Model too large for single GPU**:
```bash
vllm serve MODEL --tensor-parallel-size 2 # Parallelize prefill
```
### Symptom: Slow token generation (low tokens/sec)
**Diagnostic**:
```bash
# Check if model is correct size
vllm serve MODEL # Should see model size in logs
# Check speculative decoding
vllm serve MODEL --speculative-model DRAFT_MODEL
```
**For H100 GPUs**, enable FP8:
```bash
vllm serve MODEL --quantization fp8
```
## Model loading errors
### Symptom: `OSError: MODEL not found`
**Causes**:
1. **Model name typo**:
```bash
# Check exact model name on HuggingFace
vllm serve meta-llama/Llama-3-8B-Instruct # Correct capitalization
```
2. **Private/gated model**:
```bash
# Login to HuggingFace first
huggingface-cli login
# Then run vLLM
vllm serve meta-llama/Llama-3-70B-Instruct
```
3. **Custom model needs trust flag**:
```bash
vllm serve MODEL --trust-remote-code
```
### Symptom: `ValueError: Tokenizer not found`
**Solution**:
```bash
# Download model manually first
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('MODEL')"
# Then launch vLLM
vllm serve MODEL
```
### Symptom: `ImportError: No module named 'flash_attn'`
**Solution**:
```bash
# Install flash attention
pip install flash-attn --no-build-isolation
# Or disable flash attention
vllm serve MODEL --disable-flash-attn
```
## Network and connection issues
### Symptom: `Connection refused` when querying server
**Diagnostic**:
1. **Check server is running**:
```bash
curl http://localhost:8000/health
```
2. **Check port binding**:
```bash
# Bind to all interfaces for remote access
vllm serve MODEL --host 0.0.0.0 --port 8000
# Check if port is in use
lsof -i :8000
```
3. **Check firewall**:
```bash
# Allow port through firewall
sudo ufw allow 8000
```
### Symptom: Slow response times over network
**Solutions**:
1. **Increase timeout**:
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY",
timeout=300.0 # 5 minute timeout
)
```
2. **Check network latency**:
```bash
ping SERVER_IP # Should be &1 | tee vllm_debug.log
# Include in bug report:
# - vllm_debug.log
# - nvidia-smi output
# - Full command used
# - Expected vs actual behavior
```
Source: claude-code-templates (MIT). See About Us for full credits.