[ PROMPT_NODE_22479 ]

Inference Serving Tensorrt LLM – Optimization

[ SKILL_DOCUMENTATION ]

# TensorRT-LLM Optimization Guide Comprehensive guide to optimizing LLM inference with TensorRT-LLM. ## Quantization ### FP8 Quantization (Recommended for H100) **Benefits**: - 2× faster inference - 50% memory reduction - Minimal accuracy loss (8K tokens with limited GPU memory **Configuration**: ```bash trtllm-serve meta-llama/Meta-Llama-3-8B --max_num_tokens 4096 --enable_chunked_context --max_chunked_prefill_length 2048 # Process 2K tokens at a time ``` ## Overlap Scheduling **What it does**: Overlaps compute and memory operations. **Benefits**: - 15-25% higher throughput - Better GPU utilization - Default in v1.2.0+ **No configuration needed** - enabled automatically. ## Quantization Comparison Table | Method | Memory | Speed | Accuracy | Use Case | |--------|--------|-------|----------|----------| | FP16 | 1× (baseline) | 1× | Best | High accuracy needed | | FP8 | 0.5× | 2× | -0.5% ppl | **H100 default** | | INT4 AWQ | 0.25× | 3-4× | -1.5% ppl | Memory critical | | INT4 GPTQ | 0.25× | 3-4× | -2% ppl | Maximum speed | ## Tuning Workflow 1. **Start with defaults**: ```python llm = LLM(model="meta-llama/Meta-Llama-3-70B") ``` 2. **Enable FP8** (if H100): ```python llm = LLM(model="...", dtype="fp8") ``` 3. **Tune batch size**: ```python # Increase until OOM, then reduce 20% trtllm-serve ... --max_batch_size 256 ``` 4. **Enable chunked context** (if long prompts): ```bash --enable_chunked_context --max_chunked_prefill_length 2048 ``` 5. **Try speculative decoding** (if latency critical): ```python llm = LLM(model="...", speculative_model="...") ``` ## Benchmarking ```bash # Install benchmark tool pip install tensorrt_llm[benchmark] # Run benchmark python benchmarks/python/benchmark.py --model meta-llama/Meta-Llama-3-8B --batch_size 64 --input_len 128 --output_len 256 --dtype fp8 ``` **Metrics to track**: - Throughput (tokens/sec) - Latency P50/P90/P99 (ms) - GPU memory usage (GB) - GPU utilization (%) ## Common Issues **OOM errors**: - Reduce `max_batch_size` - Reduce `max_num_tokens` - Enable INT4 quantization - Increase `tensor_parallel_size` **Low throughput**: - Increase `max_batch_size` - Enable in-flight batching - Verify CUDA graphs enabled - Check GPU utilization **High latency**: - Try speculative decoding - Reduce `max_batch_size` (less queueing) - Use FP8 instead of FP16

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI