[ PROMPT_NODE_22469 ]

Inference Serving SGLANG – Deployment

[ SKILL_DOCUMENTATION ]

# Production Deployment Guide Complete guide to deploying SGLang in production environments. ## Server Deployment ### Basic server ```bash python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --host 0.0.0.0 --port 30000 --mem-fraction-static 0.9 ``` ### Multi-GPU (Tensor Parallelism) ```bash # Llama 3-70B on 4 GPUs python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-70B-Instruct --tp 4 --port 30000 ``` ### Quantization ```bash # FP8 quantization (H100) python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-70B-Instruct --quantization fp8 --tp 4 # INT4 AWQ quantization python -m sglang.launch_server --model-path TheBloke/Llama-2-70B-AWQ --quantization awq --tp 2 # INT4 GPTQ quantization python -m sglang.launch_server --model-path TheBloke/Llama-2-70B-GPTQ --quantization gptq --tp 2 ``` ## Docker Deployment ### Dockerfile ```dockerfile FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 # Install Python RUN apt-get update && apt-get install -y python3.10 python3-pip git # Install SGLang RUN pip3 install "sglang[all]" flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ # Copy model (or download at runtime) WORKDIR /app # Expose port EXPOSE 30000 # Start server CMD ["python3", "-m", "sglang.launch_server", "--model-path", "meta-llama/Meta-Llama-3-8B-Instruct", "--host", "0.0.0.0", "--port", "30000"] ``` ### Build and run ```bash # Build image docker build -t sglang:latest . # Run with GPU docker run --gpus all -p 30000:30000 sglang:latest # Run with specific GPUs docker run --gpus '"device=0,1,2,3"' -p 30000:30000 sglang:latest # Run with custom model docker run --gpus all -p 30000:30000 -e MODEL_PATH="meta-llama/Meta-Llama-3-70B-Instruct" -e TP_SIZE="4" sglang:latest ``` ## Kubernetes Deployment ### Deployment YAML ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: sglang-llama3-70b spec: replicas: 2 selector: matchLabels: app: sglang template: metadata: labels: app: sglang spec: containers: - name: sglang image: sglang:latest command: - python3 - -m - sglang.launch_server - --model-path=meta-llama/Meta-Llama-3-70B-Instruct - --tp=4 - --host=0.0.0.0 - --port=30000 - --mem-fraction-static=0.9 ports: - containerPort: 30000 name: http resources: limits: nvidia.com/gpu: 4 livenessProbe: httpGet: path: /health port: 30000 initialDelaySeconds: 60 periodSeconds: 10 readinessProbe: httpGet: path: /health port: 30000 initialDelaySeconds: 30 periodSeconds: 5 --- apiVersion: v1 kind: Service metadata: name: sglang-service spec: selector: app: sglang ports: - port: 80 targetPort: 30000 type: LoadBalancer ``` ## Monitoring ### Health checks ```bash # Health endpoint curl http://localhost:30000/health # Model info curl http://localhost:30000/v1/models # Server stats curl http://localhost:30000/stats ``` ### Prometheus metrics ```bash # Start server with metrics python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --enable-metrics # Metrics endpoint curl http://localhost:30000/metrics # Key metrics: # - sglang_request_total # - sglang_request_duration_seconds # - sglang_tokens_generated_total # - sglang_active_requests # - sglang_queue_size # - sglang_radix_cache_hit_rate # - sglang_gpu_memory_used_bytes ``` ### Logging ```bash # Enable debug logging python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --log-level debug # Log to file python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --log-file /var/log/sglang.log ``` ## Load Balancing ### NGINX configuration ```nginx upstream sglang_backend { least_conn; # Route to least busy instance server sglang-1:30000 max_fails=3 fail_timeout=30s; server sglang-2:30000 max_fails=3 fail_timeout=30s; server sglang-3:30000 max_fails=3 fail_timeout=30s; } server { listen 80; location / { proxy_pass http://sglang_backend; proxy_http_version 1.1; proxy_set_header Connection ""; proxy_read_timeout 300s; proxy_connect_timeout 10s; # For streaming proxy_buffering off; proxy_cache off; } location /metrics { proxy_pass http://sglang_backend/metrics; } } ``` ## Autoscaling ### HPA based on GPU utilization ```yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: sglang-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: sglang-llama3-70b minReplicas: 2 maxReplicas: 10 metrics: - type: Pods pods: metric: name: nvidia_gpu_duty_cycle target: type: AverageValue averageValue: "80" # Scale when GPU >80% ``` ### HPA based on active requests ```yaml metrics: - type: Pods pods: metric: name: sglang_active_requests target: type: AverageValue averageValue: "50" # Scale when >50 active requests per pod ``` ## Performance Tuning ### Memory optimization ```bash # Reduce memory usage python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-70B-Instruct --tp 4 --mem-fraction-static 0.85 # Use 85% of GPU memory --max-radix-cache-len 8192 # Limit cache to 8K tokens ``` ### Throughput optimization ```bash # Maximize throughput python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.95 # More memory for batching --max-radix-cache-len 16384 # Larger cache --max-running-requests 256 # More concurrent requests ``` ### Latency optimization ```bash # Minimize latency python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --max-running-requests 32 # Fewer concurrent (less queueing) --schedule-policy fcfs # First-come first-served ``` ## Multi-Node Deployment ### Ray cluster setup ```bash # Head node ray start --head --port=6379 # Worker nodes ray start --address='head-node:6379' # Launch server across cluster python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-405B-Instruct --tp 8 --num-nodes 2 # Use 2 nodes (8 GPUs each) ``` ## Security ### API authentication ```bash # Start with API key python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --api-key YOUR_SECRET_KEY # Client request curl http://localhost:30000/v1/chat/completions -H "Authorization: Bearer YOUR_SECRET_KEY" -H "Content-Type: application/json" -d '{"model": "default", "messages": [...]}' ``` ### Network policies (Kubernetes) ```yaml apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: sglang-policy spec: podSelector: matchLabels: app: sglang policyTypes: - Ingress ingress: - from: - podSelector: matchLabels: app: api-gateway # Only allow from gateway ports: - protocol: TCP port: 30000 ``` ## Troubleshooting ### High memory usage **Check**: ```bash nvidia-smi curl http://localhost:30000/stats | grep cache ``` **Solutions**: ```bash # Reduce cache size --max-radix-cache-len 4096 # Reduce memory fraction --mem-fraction-static 0.75 # Enable quantization --quantization fp8 ``` ### Low throughput **Check**: ```bash curl http://localhost:30000/stats | grep queue_size ``` **Solutions**: ```bash # Increase batch size --max-running-requests 256 # Add more GPUs --tp 4 # Increase tensor parallelism # Check cache hit rate (should be >70%) curl http://localhost:30000/stats | grep cache_hit_rate ``` ### High latency **Check**: ```bash curl http://localhost:30000/metrics | grep duration ``` **Solutions**: ```bash # Reduce concurrent requests --max-running-requests 32 # Use FCFS scheduling (no batching delay) --schedule-policy fcfs # Add more replicas (horizontal scaling) ``` ### OOM errors **Solutions**: ```bash # Reduce batch size --max-running-requests 128 # Reduce cache --max-radix-cache-len 2048 # Enable quantization --quantization awq # Increase tensor parallelism --tp 8 ``` ## Best Practices 1. **Use RadixAttention** - Enabled by default, 5-10× speedup for agents 2. **Monitor cache hit rate** - Target >70% for agent/few-shot workloads 3. **Set health checks** - Use `/health` endpoint for k8s probes 4. **Enable metrics** - Monitor with Prometheus + Grafana 5. **Use load balancing** - Distribute load across replicas 6. **Tune memory** - Start with `--mem-fraction-static 0.9`, adjust based on OOM 7. **Use quantization** - FP8 on H100, AWQ/GPTQ on A100 8. **Set up autoscaling** - Scale based on GPU utilization or active requests 9. **Log to persistent storage** - Use `--log-file` for debugging 10. **Test before production** - Run load tests with expected traffic patterns ## Cost Optimization ### GPU selection **A100 80GB** ($3-4/hour): - Llama 3-70B with FP8 (TP=4) - Throughput: 10,000-15,000 tok/s - Cost per 1M tokens: $0.20-0.30 **H100 80GB** ($6-8/hour): - Llama 3-70B with FP8 (TP=4) - Throughput: 20,000-30,000 tok/s - Cost per 1M tokens: $0.15-0.25 (2× faster) **L4** ($0.50-1/hour): - Llama 3-8B - Throughput: 1,500-2,500 tok/s - Cost per 1M tokens: $0.20-0.40 ### Batching for cost efficiency **Low batch (batch=1)**: - Throughput: 1,000 tok/s - Cost: $3/hour ÷ 1M tok/hour = $3/M tokens **High batch (batch=128)**: - Throughput: 8,000 tok/s - Cost: $3/hour ÷ 8M tok/hour = $0.375/M tokens - **8× cost reduction** **Recommendation**: Target batch size 64-256 for optimal cost/latency.

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI