[ PROMPT_NODE_22469 ]
Inference Serving SGLANG – Deployment
[ SKILL_DOCUMENTATION ]
# Production Deployment Guide
Complete guide to deploying SGLang in production environments.
## Server Deployment
### Basic server
```bash
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--host 0.0.0.0
--port 30000
--mem-fraction-static 0.9
```
### Multi-GPU (Tensor Parallelism)
```bash
# Llama 3-70B on 4 GPUs
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--tp 4
--port 30000
```
### Quantization
```bash
# FP8 quantization (H100)
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--quantization fp8
--tp 4
# INT4 AWQ quantization
python -m sglang.launch_server
--model-path TheBloke/Llama-2-70B-AWQ
--quantization awq
--tp 2
# INT4 GPTQ quantization
python -m sglang.launch_server
--model-path TheBloke/Llama-2-70B-GPTQ
--quantization gptq
--tp 2
```
## Docker Deployment
### Dockerfile
```dockerfile
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04
# Install Python
RUN apt-get update && apt-get install -y python3.10 python3-pip git
# Install SGLang
RUN pip3 install "sglang[all]" flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
# Copy model (or download at runtime)
WORKDIR /app
# Expose port
EXPOSE 30000
# Start server
CMD ["python3", "-m", "sglang.launch_server",
"--model-path", "meta-llama/Meta-Llama-3-8B-Instruct",
"--host", "0.0.0.0",
"--port", "30000"]
```
### Build and run
```bash
# Build image
docker build -t sglang:latest .
# Run with GPU
docker run --gpus all -p 30000:30000 sglang:latest
# Run with specific GPUs
docker run --gpus '"device=0,1,2,3"' -p 30000:30000 sglang:latest
# Run with custom model
docker run --gpus all -p 30000:30000
-e MODEL_PATH="meta-llama/Meta-Llama-3-70B-Instruct"
-e TP_SIZE="4"
sglang:latest
```
## Kubernetes Deployment
### Deployment YAML
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: sglang-llama3-70b
spec:
replicas: 2
selector:
matchLabels:
app: sglang
template:
metadata:
labels:
app: sglang
spec:
containers:
- name: sglang
image: sglang:latest
command:
- python3
- -m
- sglang.launch_server
- --model-path=meta-llama/Meta-Llama-3-70B-Instruct
- --tp=4
- --host=0.0.0.0
- --port=30000
- --mem-fraction-static=0.9
ports:
- containerPort: 30000
name: http
resources:
limits:
nvidia.com/gpu: 4
livenessProbe:
httpGet:
path: /health
port: 30000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 30000
initialDelaySeconds: 30
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: sglang-service
spec:
selector:
app: sglang
ports:
- port: 80
targetPort: 30000
type: LoadBalancer
```
## Monitoring
### Health checks
```bash
# Health endpoint
curl http://localhost:30000/health
# Model info
curl http://localhost:30000/v1/models
# Server stats
curl http://localhost:30000/stats
```
### Prometheus metrics
```bash
# Start server with metrics
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--enable-metrics
# Metrics endpoint
curl http://localhost:30000/metrics
# Key metrics:
# - sglang_request_total
# - sglang_request_duration_seconds
# - sglang_tokens_generated_total
# - sglang_active_requests
# - sglang_queue_size
# - sglang_radix_cache_hit_rate
# - sglang_gpu_memory_used_bytes
```
### Logging
```bash
# Enable debug logging
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--log-level debug
# Log to file
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--log-file /var/log/sglang.log
```
## Load Balancing
### NGINX configuration
```nginx
upstream sglang_backend {
least_conn; # Route to least busy instance
server sglang-1:30000 max_fails=3 fail_timeout=30s;
server sglang-2:30000 max_fails=3 fail_timeout=30s;
server sglang-3:30000 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
location / {
proxy_pass http://sglang_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
proxy_connect_timeout 10s;
# For streaming
proxy_buffering off;
proxy_cache off;
}
location /metrics {
proxy_pass http://sglang_backend/metrics;
}
}
```
## Autoscaling
### HPA based on GPU utilization
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sglang-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: sglang-llama3-70b
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: nvidia_gpu_duty_cycle
target:
type: AverageValue
averageValue: "80" # Scale when GPU >80%
```
### HPA based on active requests
```yaml
metrics:
- type: Pods
pods:
metric:
name: sglang_active_requests
target:
type: AverageValue
averageValue: "50" # Scale when >50 active requests per pod
```
## Performance Tuning
### Memory optimization
```bash
# Reduce memory usage
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-70B-Instruct
--tp 4
--mem-fraction-static 0.85 # Use 85% of GPU memory
--max-radix-cache-len 8192 # Limit cache to 8K tokens
```
### Throughput optimization
```bash
# Maximize throughput
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--mem-fraction-static 0.95 # More memory for batching
--max-radix-cache-len 16384 # Larger cache
--max-running-requests 256 # More concurrent requests
```
### Latency optimization
```bash
# Minimize latency
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--max-running-requests 32 # Fewer concurrent (less queueing)
--schedule-policy fcfs # First-come first-served
```
## Multi-Node Deployment
### Ray cluster setup
```bash
# Head node
ray start --head --port=6379
# Worker nodes
ray start --address='head-node:6379'
# Launch server across cluster
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-405B-Instruct
--tp 8
--num-nodes 2 # Use 2 nodes (8 GPUs each)
```
## Security
### API authentication
```bash
# Start with API key
python -m sglang.launch_server
--model-path meta-llama/Meta-Llama-3-8B-Instruct
--api-key YOUR_SECRET_KEY
# Client request
curl http://localhost:30000/v1/chat/completions
-H "Authorization: Bearer YOUR_SECRET_KEY"
-H "Content-Type: application/json"
-d '{"model": "default", "messages": [...]}'
```
### Network policies (Kubernetes)
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: sglang-policy
spec:
podSelector:
matchLabels:
app: sglang
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: api-gateway # Only allow from gateway
ports:
- protocol: TCP
port: 30000
```
## Troubleshooting
### High memory usage
**Check**:
```bash
nvidia-smi
curl http://localhost:30000/stats | grep cache
```
**Solutions**:
```bash
# Reduce cache size
--max-radix-cache-len 4096
# Reduce memory fraction
--mem-fraction-static 0.75
# Enable quantization
--quantization fp8
```
### Low throughput
**Check**:
```bash
curl http://localhost:30000/stats | grep queue_size
```
**Solutions**:
```bash
# Increase batch size
--max-running-requests 256
# Add more GPUs
--tp 4 # Increase tensor parallelism
# Check cache hit rate (should be >70%)
curl http://localhost:30000/stats | grep cache_hit_rate
```
### High latency
**Check**:
```bash
curl http://localhost:30000/metrics | grep duration
```
**Solutions**:
```bash
# Reduce concurrent requests
--max-running-requests 32
# Use FCFS scheduling (no batching delay)
--schedule-policy fcfs
# Add more replicas (horizontal scaling)
```
### OOM errors
**Solutions**:
```bash
# Reduce batch size
--max-running-requests 128
# Reduce cache
--max-radix-cache-len 2048
# Enable quantization
--quantization awq
# Increase tensor parallelism
--tp 8
```
## Best Practices
1. **Use RadixAttention** - Enabled by default, 5-10× speedup for agents
2. **Monitor cache hit rate** - Target >70% for agent/few-shot workloads
3. **Set health checks** - Use `/health` endpoint for k8s probes
4. **Enable metrics** - Monitor with Prometheus + Grafana
5. **Use load balancing** - Distribute load across replicas
6. **Tune memory** - Start with `--mem-fraction-static 0.9`, adjust based on OOM
7. **Use quantization** - FP8 on H100, AWQ/GPTQ on A100
8. **Set up autoscaling** - Scale based on GPU utilization or active requests
9. **Log to persistent storage** - Use `--log-file` for debugging
10. **Test before production** - Run load tests with expected traffic patterns
## Cost Optimization
### GPU selection
**A100 80GB** ($3-4/hour):
- Llama 3-70B with FP8 (TP=4)
- Throughput: 10,000-15,000 tok/s
- Cost per 1M tokens: $0.20-0.30
**H100 80GB** ($6-8/hour):
- Llama 3-70B with FP8 (TP=4)
- Throughput: 20,000-30,000 tok/s
- Cost per 1M tokens: $0.15-0.25 (2× faster)
**L4** ($0.50-1/hour):
- Llama 3-8B
- Throughput: 1,500-2,500 tok/s
- Cost per 1M tokens: $0.20-0.40
### Batching for cost efficiency
**Low batch (batch=1)**:
- Throughput: 1,000 tok/s
- Cost: $3/hour ÷ 1M tok/hour = $3/M tokens
**High batch (batch=128)**:
- Throughput: 8,000 tok/s
- Cost: $3/hour ÷ 8M tok/hour = $0.375/M tokens
- **8× cost reduction**
**Recommendation**: Target batch size 64-256 for optimal cost/latency.
Source: claude-code-templates (MIT). See About Us for full credits.