[ PROMPT_NODE_22509 ]
Infrastructure Skypilot – Troubleshooting
[ SKILL_DOCUMENTATION ]
# SkyPilot Troubleshooting Guide
## Installation Issues
### Cloud credentials not found
**Error**: `sky check` shows clouds as disabled
**Solutions**:
```bash
# AWS
aws configure
# Verify: aws sts get-caller-identity
# GCP
gcloud auth application-default login
# Verify: gcloud auth list
# Azure
az login
az account set -s
# Kubernetes
export KUBECONFIG=~/.kube/config
kubectl get nodes
# Re-check after configuration
sky check
```
### Permission errors
**Error**: `PermissionError` or `AccessDenied`
**Solutions**:
```bash
# AWS: Ensure IAM permissions include EC2, S3, IAM
# Required policies: AmazonEC2FullAccess, AmazonS3FullAccess, IAMFullAccess
# GCP: Ensure roles include Compute Admin, Storage Admin
gcloud projects add-iam-policy-binding PROJECT_ID
--member="user:[email protected]"
--role="roles/compute.admin"
# Azure: Ensure Contributor role on subscription
az role assignment create
--assignee [email protected]
--role Contributor
--scope /subscriptions/SUBSCRIPTION_ID
```
## Cluster Launch Issues
### Quota exceeded
**Error**: `Quota exceeded for resource`
**Solutions**:
```yaml
# Try different region
resources:
accelerators: A100:8
any_of:
- cloud: gcp
region: us-west1
- cloud: gcp
region: europe-west4
- cloud: aws
region: us-east-1
# Or request quota increase from cloud provider
```
```bash
# Check quota before launching
sky show-gpus --cloud gcp
```
### GPU not available
**Error**: `No resources available in region`
**Solutions**:
```yaml
# Use fallback accelerators
resources:
accelerators:
H100: 8
A100-80GB: 8
A100: 8
any_of:
- cloud: gcp
- cloud: aws
- cloud: azure
```
```bash
# Check GPU availability
sky show-gpus A100
sky show-gpus --cloud aws
```
### Instance type not found
**Error**: `Instance type 'xyz' not found`
**Solutions**:
```yaml
# Let SkyPilot choose instance automatically
resources:
accelerators: A100:8
cpus: 96+
memory: 512+
# Don't specify instance_type unless necessary
```
### Cluster stuck in INIT
**Error**: Cluster stays in INIT state
**Solutions**:
```bash
# Check cluster logs
sky logs mycluster --status
# SSH and check manually
ssh mycluster
journalctl -u sky-supervisor
# Terminate and retry
sky down mycluster
sky launch -c mycluster task.yaml
```
## Setup Command Issues
### Setup script fails
**Error**: Setup commands fail during provisioning
**Solutions**:
```yaml
# Add error handling and retries
setup: |
set -e # Exit on error
# Retry pip installs
for i in {1..3}; do
pip install torch transformers && break
echo "Retry $i..."
sleep 10
done
# Verify installation
python -c "import torch; print(torch.__version__)"
```
### Conda environment issues
**Error**: Conda not found or environment issues
**Solutions**:
```yaml
setup: |
# Initialize conda for bash
source ~/.bashrc
# Or use full path
~/miniconda3/bin/conda create -n myenv python=3.10 -y
~/miniconda3/bin/conda activate myenv
```
### CUDA version mismatch
**Error**: `CUDA driver version is insufficient`
**Solutions**:
```yaml
setup: |
# Install specific CUDA version
pip install torch==2.1.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html
# Verify CUDA
python -c "import torch; print(torch.cuda.is_available())"
```
## Distributed Training Issues
### Nodes can't communicate
**Error**: Connection refused between nodes
**Solutions**:
```yaml
run: |
# Debug: Print all node IPs
echo "All nodes: $SKYPILOT_NODE_IPS"
echo "My rank: $SKYPILOT_NODE_RANK"
# Wait for all nodes to be ready
sleep 30
# Use correct master address
MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
echo "Master: $MASTER_ADDR"
```
### torchrun fails
**Error**: `torch.distributed` errors
**Solutions**:
```yaml
run: |
# Ensure correct environment variables
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1 # Try if InfiniBand issues
torchrun
--nnodes=$SKYPILOT_NUM_NODES
--nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE
--node_rank=$SKYPILOT_NODE_RANK
--master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
--master_port=12355
--rdzv_backend=c10d
train.py
```
### DeepSpeed hostfile errors
**Error**: `Invalid hostfile` or connection errors
**Solutions**:
```yaml
run: |
# Create proper hostfile
echo "$SKYPILOT_NODE_IPS" | while read ip; do
echo "$ip slots=$SKYPILOT_NUM_GPUS_PER_NODE"
done > /tmp/hostfile
cat /tmp/hostfile # Debug
deepspeed --hostfile=/tmp/hostfile train.py
```
## File Mount Issues
### Mount fails
**Error**: `Failed to mount storage`
**Solutions**:
```yaml
# Verify bucket exists and credentials are valid
file_mounts:
/data:
source: s3://my-bucket/data
mode: MOUNT
# Check bucket access
# aws s3 ls s3://my-bucket/
```
### Slow file access
**Problem**: Reading from mount is very slow
**Solutions**:
```yaml
# Use COPY mode for small datasets
file_mounts:
/data:
source: s3://bucket/data
mode: COPY # Pre-fetch to local disk
# Use MOUNT_CACHED for outputs
file_mounts:
/outputs:
name: outputs
store: s3
mode: MOUNT_CACHED # Cached writes
```
### Storage not persisting
**Error**: Data lost after cluster restart
**Solutions**:
```yaml
# Use named storage (persists across clusters)
file_mounts:
/persistent:
name: my-persistent-storage
store: s3
mode: MOUNT
# Data in ~/sky_workdir is NOT persisted
# Always use file_mounts for persistent data
```
## Managed Job Issues
### Job keeps failing
**Error**: Job fails and doesn't recover
**Solutions**:
```yaml
# Enable spot recovery
resources:
use_spot: true
spot_recovery: FAILOVER
# Add retry logic
max_restarts_on_errors: 5
# Implement checkpointing
run: |
python train.py
--checkpoint-dir /checkpoints
--resume-from-latest
```
### Job stuck in pending
**Error**: Job stays in PENDING state
**Solutions**:
```bash
# Check job controller status
sky jobs controller status
# View controller logs
sky jobs controller logs
# Restart controller if needed
sky jobs controller restart
```
### Checkpoint not resuming
**Error**: Training restarts from beginning
**Solutions**:
```yaml
file_mounts:
/checkpoints:
name: training-checkpoints
store: s3
mode: MOUNT_CACHED
run: |
# Check for existing checkpoint
if [ -d "/checkpoints/latest" ]; then
RESUME_FLAG="--resume /checkpoints/latest"
else
RESUME_FLAG=""
fi
python train.py $RESUME_FLAG --checkpoint-dir /checkpoints
```
## Sky Serve Issues
### Service not accessible
**Error**: Cannot reach service endpoint
**Solutions**:
```bash
# Check service status
sky serve status my-service
# View replica logs
sky serve logs my-service
# Check readiness probe
sky serve status my-service --endpoint
```
### Replicas keep crashing
**Error**: Replicas fail health checks
**Solutions**:
```yaml
service:
readiness_probe:
path: /health
initial_delay_seconds: 120 # Increase for slow model loading
period_seconds: 30
timeout_seconds: 10
run: |
# Ensure health endpoint exists
python -c "
from fastapi import FastAPI
app = FastAPI()
@app.get('/health')
def health():
return {'status': 'ok'}
"
```
### Autoscaling not working
**Problem**: Service doesn't scale up/down
**Solutions**:
```yaml
service:
replica_policy:
min_replicas: 1
max_replicas: 10
target_qps_per_replica: 2.0
upscale_delay_seconds: 30 # Faster scale up
downscale_delay_seconds: 60 # Faster scale down
# Monitor metrics
# sky serve status my-service
```
## SSH and Access Issues
### Cannot SSH to cluster
**Error**: `Connection refused` or timeout
**Solutions**:
```bash
# Verify cluster is running
sky status
# Try with verbose output
ssh -v mycluster
# Check SSH key
ls -la ~/.ssh/sky-key*
# Regenerate SSH key if needed
sky launch -c test --dryrun # Regenerates key
```
### Port forwarding fails
**Error**: Cannot forward ports
**Solutions**:
```bash
# Correct syntax
ssh -L 8080:baguaai.com mycluster
# For Jupyter
ssh -L 8888:localhost:8888 mycluster
# Multiple ports
ssh -L 8080:baguaai.com -L 6006:localhost:6006 mycluster
```
## Cost and Billing Issues
### Unexpected charges
**Problem**: Higher than expected costs
**Solutions**:
```bash
# Always terminate unused clusters
sky down --all
# Set autostop
sky autostop mycluster -i 30 --down
# Use spot instances
resources:
use_spot: true
```
### Spot instance preempted
**Error**: Instance terminated unexpectedly
**Solutions**:
```yaml
# Use managed jobs for automatic recovery
# sky jobs launch instead of sky launch
resources:
use_spot: true
spot_recovery: FAILOVER # Auto-failover to another region/cloud
# Always checkpoint frequently when using spot
```
## Debugging Commands
### View cluster state
```bash
# Cluster status
sky status
sky status -a # Show all details
# Cluster resources
sky show-gpus
# Cloud credentials
sky check
```
### View logs
```bash
# Task logs
sky logs mycluster
sky logs mycluster 1 # Specific job
# Managed job logs
sky jobs logs my-job
sky jobs logs my-job --follow
# Service logs
sky serve logs my-service
```
### Inspect cluster
```bash
# SSH to cluster
ssh mycluster
# Check GPU status
nvidia-smi
# Check processes
ps aux | grep python
# Check disk space
df -h
```
## Common Error Messages
| Error | Cause | Solution |
|-------|-------|----------|
| `No launchable resources` | No available instances | Try different region/cloud |
| `Quota exceeded` | Cloud quota limit | Request increase or use different cloud |
| `Setup failed` | Script error | Check logs, add error handling |
| `Connection refused` | Network/firewall | Check security groups, wait for init |
| `CUDA OOM` | Out of GPU memory | Use larger GPU or reduce batch size |
| `Spot preempted` | Spot instance reclaimed | Use managed jobs for auto-recovery |
| `Mount failed` | Storage access issue | Check credentials and bucket exists |
## Getting Help
1. **Documentation**: https://docs.skypilot.co
2. **GitHub Issues**: https://github.com/skypilot-org/skypilot/issues
3. **Slack**: https://slack.skypilot.co
4. **Examples**: https://github.com/skypilot-org/skypilot/tree/master/examples
### Reporting Issues
Include:
- SkyPilot version: `sky --version`
- Python version: `python --version`
- Cloud provider and region
- Full error traceback
- Task YAML (sanitized)
- Output of `sky check`
Source: claude-code-templates (MIT). See About Us for full credits.