[ PROMPT_NODE_24909 ]
Infrastructure
[ SKILL_DOCUMENTATION ]
# Infrastructure Management
Comprehensive guide to server management, network operations, capacity planning, and infrastructure operations for IT teams.
## Table of Contents
- [Server Management](#server-management)
- [Network Operations](#network-operations)
- [Capacity Planning](#capacity-planning)
- [Storage Management](#storage-management)
- [Virtualization](#virtualization)
- [Cloud Infrastructure](#cloud-infrastructure)
- [Infrastructure as Code](#infrastructure-as-code)
- [Patching and Updates](#patching-and-updates)
- [Performance Optimization](#performance-optimization)
- [Cost Management](#cost-management)
## Server Management
### Server Lifecycle
```yaml
Phase 1: Procurement
Actions:
- Define requirements (CPU, RAM, storage, network)
- Select vendor (Dell, HP, Lenovo, etc.)
- Purchase or lease decision
- Order hardware
Timeline: 4-12 weeks
Phase 2: Provisioning
Actions:
- Receive and inventory hardware
- Rack and cable servers
- Install operating system
- Apply baseline configuration
- Install monitoring agents
- Document in CMDB
Timeline: 1-2 days per server
Phase 3: Deployment
Actions:
- Install application software
- Configure networking and firewall rules
- Set up backups
- Load balancer configuration
- Run acceptance tests
- Hand off to application team
Timeline: 2-5 days
Phase 4: Operations (2-5 years)
Actions:
- Monitor performance and health
- Apply security patches
- Perform maintenance
- Capacity planning
- Incident response
Timeline: 2-5 years typical hardware lifecycle
Phase 5: Decommissioning
Actions:
- Migrate workloads to new servers
- Backup all data
- Wipe drives (secure erase)
- Remove from monitoring
- Update CMDB
- Physical disposal or return
Timeline: 1-2 weeks
```
### Operating System Management
**Linux Server Setup (Ubuntu/RHEL)**:
```bash
#!/bin/bash
# Server baseline configuration script
set -e
echo "=== Server Baseline Configuration ==="
# 1. System Updates
echo "Updating system packages..."
apt-get update && apt-get upgrade -y # Ubuntu/Debian
# yum update -y # RHEL/CentOS
# 2. Set hostname
HOSTNAME="web-server-01.example.com"
hostnamectl set-hostname $HOSTNAME
echo "Hostname set to: $HOSTNAME"
# 3. Configure NTP for time synchronization
echo "Configuring NTP..."
timedatectl set-timezone UTC
apt-get install -y chrony
systemctl enable chrony
systemctl start chrony
# 4. Configure SSH hardening
echo "Hardening SSH configuration..."
sed -i 's/#PermitRootLogin yes/PermitRootLogin no/' /etc/ssh/sshd_config
sed -i 's/#PasswordAuthentication yes/PasswordAuthentication no/' /etc/ssh/sshd_config
sed -i 's/#Port 22/Port 2222/' /etc/ssh/sshd_config
systemctl restart sshd
# 5. Configure firewall
echo "Configuring firewall..."
ufw default deny incoming
ufw default allow outgoing
ufw allow 2222/tcp # SSH
ufw allow 80/tcp # HTTP
ufw allow 443/tcp # HTTPS
ufw --force enable
# 6. Install monitoring agent
echo "Installing monitoring agent..."
wget -O /tmp/node_exporter.tar.gz https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz /tmp/node_exporter.tar.gz -C /opt/
cat > /etc/systemd/system/node_exporter.service <> /etc/rsyslog.d/50-remote.conf < /etc/iptables/rules.v4
echo "Firewall rules configured."
```
**Load Balancer Configuration (HAProxy)**:
```haproxy
# /etc/haproxy/haproxy.cfg
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
stats timeout 30s
user haproxy
group haproxy
daemon
# SSL/TLS configuration
ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256
ssl-default-bind-options ssl-min-ver TLSv1.2
defaults
log global
mode http
option httplog
option dontlognull
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
# Frontend configuration (HTTPS)
frontend https_front
bind *:443 ssl crt /etc/haproxy/certs/example.com.pem
default_backend web_servers
# Rate limiting
stick-table type ip size 100k expire 30s store http_req_rate(10s)
http-request track-sc0 src
http-request deny deny_status 429 if { sc_http_req_rate(0) gt 100 }
# Security headers
http-response set-header Strict-Transport-Security "max-age=31536000; includeSubDomains"
http-response set-header X-Frame-Options "SAMEORIGIN"
http-response set-header X-Content-Type-Options "nosniff"
# Backend configuration
backend web_servers
balance roundrobin
option httpchk GET /health HTTP/1.1rnHost: example.com
http-check expect status 200
server web01 10.0.10.20:80 check inter 5s rise 2 fall 3
server web02 10.0.10.21:80 check inter 5s rise 2 fall 3
server web03 10.0.10.22:80 check inter 5s rise 2 fall 3
# Statistics page
listen stats
bind *:8404
stats enable
stats uri /stats
stats refresh 30s
stats auth admin:password123
```
### Network Troubleshooting
**Network Diagnostic Commands**:
```bash
# Test connectivity
ping -c 4 8.8.8.8 # Basic connectivity
ping -c 4 google.com # DNS resolution + connectivity
# Trace route
traceroute google.com # Linux
tracert google.com # Windows
mtr google.com # Continuous traceroute (Linux)
# DNS troubleshooting
nslookup google.com # Basic DNS lookup
dig google.com # Detailed DNS query
dig @8.8.8.8 google.com # Query specific DNS server
# Port connectivity
telnet example.com 80 # Test if port is open
nc -zv example.com 80 # Netcat port scan
curl -v https://example.com # HTTP/HTTPS test with verbose output
# Network interfaces
ip addr show # Show IP addresses (Linux)
ip link show # Show interface status
ifconfig # Legacy interface info
ethtool eth0 # Interface details and statistics
# Routing
ip route show # Show routing table
route -n # Numeric routing table
netstat -rn # Routing table (legacy)
# Active connections
netstat -tuln # List listening ports
ss -tuln # Socket statistics (modern replacement)
lsof -i :80 # Show what's using port 80
# Packet capture
tcpdump -i eth0 port 80 # Capture HTTP traffic
tcpdump -i eth0 -w capture.pcap # Write to file
tcpdump -r capture.pcap # Read from file
# Bandwidth testing
iperf3 -s # Server mode
iperf3 -c server-ip # Client mode
# Network statistics
netstat -s # Protocol statistics
ss -s # Socket statistics summary
iftop # Real-time bandwidth by connection
```
## Capacity Planning
### Capacity Planning Process
```yaml
Step 1: Collect Baseline Data (Ongoing)
Metrics to Track:
- CPU utilization (%, by core)
- Memory utilization (GB, %)
- Disk I/O (IOPS, throughput)
- Network throughput (Mbps)
- Application metrics (requests/sec, users)
Time Ranges:
- Real-time (1-minute granularity)
- Daily averages (for trend analysis)
- Weekly averages (for seasonality)
- Monthly aggregates (for year-over-year)
Step 2: Analyze Trends (Monthly)
Questions to Answer:
- What is the growth rate? (linear, exponential, seasonal)
- When will current capacity be exhausted?
- What are the peak utilization periods?
- Are there any unusual spikes or patterns?
Analysis Methods:
- Linear regression (simple growth)
- Time series forecasting (seasonal patterns)
- Percentile analysis (p50, p95, p99)
Step 3: Forecast Future Demand (Quarterly)
Inputs:
- Historical growth trends
- Business projections (user growth, new features)
- Upcoming marketing campaigns or events
- Industry benchmarks
Forecasting Horizons:
- Short-term (3 months): High confidence
- Medium-term (6-12 months): Moderate confidence
- Long-term (12-24 months): Low confidence, scenario planning
Step 4: Capacity Modeling
Calculate Required Capacity:
- Current capacity
- Growth rate
- Target headroom (20-30%)
- Expected utilization after expansion
Example:
Current CPU utilization: 70%
Growth rate: 10% per month
In 6 months: 70% × (1.1)^6 = 124% (will exceed capacity)
Action: Add capacity within 3 months
Step 5: Plan and Execute (As Needed)
Options:
- Vertical scaling (add CPU/RAM to existing servers)
- Horizontal scaling (add more servers)
- Optimize application (reduce resource usage)
Considerations:
- Lead time (procurement, deployment)
- Budget approval process
- Maintenance windows
- Risk mitigation (pilot, canary, rollback plan)
```
### Capacity Planning Calculations
**CPU Capacity**:
```python
# CPU capacity planning calculator
def calculate_cpu_capacity(current_util_pct, growth_rate_monthly, months, target_headroom=0.30):
"""
Calculate when CPU capacity will be exhausted
Args:
current_util_pct: Current CPU utilization (0-1)
growth_rate_monthly: Monthly growth rate (e.g., 0.10 for 10%)
months: Forecast period in months
target_headroom: Desired headroom (0.30 = 30%)
Returns:
dict with forecast and recommendations
"""
import math
# Calculate future utilization
future_util = current_util_pct * ((1 + growth_rate_monthly) ** months)
# Calculate when capacity will be exhausted (reach 100%)
if growth_rate_monthly > 0:
months_to_exhaustion = math.log(1.0 / current_util_pct) / math.log(1 + growth_rate_monthly)
else:
months_to_exhaustion = float('inf')
# Calculate when to add capacity (to maintain headroom)
target_max_util = 1.0 - target_headroom
months_to_action = math.log(target_max_util / current_util_pct) / math.log(1 + growth_rate_monthly)
# Calculate required scaling factor
scaling_factor = future_util / target_max_util if future_util > target_max_util else 1.0
return {
'current_utilization_pct': current_util_pct * 100,
'forecasted_utilization_pct': future_util * 100,
'months_to_exhaustion': months_to_exhaustion,
'months_to_action': months_to_action,
'scaling_factor': scaling_factor,
'recommendation': 'Add capacity' if scaling_factor > 1.0 else 'No action needed'
}
# Example usage
result = calculate_cpu_capacity(
current_util_pct=0.65, # 65% current utilization
growth_rate_monthly=0.08, # 8% monthly growth
months=6, # 6-month forecast
target_headroom=0.30 # Maintain 30% headroom
)
print(f"Current Utilization: {result['current_utilization_pct']:.1f}%")
print(f"Forecasted Utilization (6 months): {result['forecasted_utilization_pct']:.1f}%")
print(f"Months Until Capacity Exhausted: {result['months_to_exhaustion']:.1f}")
print(f"Months Until Action Needed: {result['months_to_action']:.1f}")
print(f"Scaling Factor Required: {result['scaling_factor']:.2f}x")
print(f"Recommendation: {result['recommendation']}")
# Output:
# Current Utilization: 65.0%
# Forecasted Utilization (6 months): 103.3%
# Months Until Capacity Exhausted: 5.2
# Months Until Action Needed: 2.7
# Scaling Factor Required: 1.48x
# Recommendation: Add capacity
```
**Storage Capacity**:
```python
# Storage capacity planning
def calculate_storage_capacity(current_usage_gb, growth_rate_daily_gb, days, total_capacity_gb):
"""Calculate storage capacity forecast"""
future_usage_gb = current_usage_gb + (growth_rate_daily_gb * days)
utilization_pct = (future_usage_gb / total_capacity_gb) * 100
days_to_full = (total_capacity_gb - current_usage_gb) / growth_rate_daily_gb if growth_rate_daily_gb > 0 else float('inf')
return {
'current_usage_gb': current_usage_gb,
'current_utilization_pct': (current_usage_gb / total_capacity_gb) * 100,
'forecasted_usage_gb': future_usage_gb,
'forecasted_utilization_pct': utilization_pct,
'days_to_full': days_to_full,
'recommendation': 'Add storage' if utilization_pct > 80 else 'No action needed'
}
# Example: Database server storage
result = calculate_storage_capacity(
current_usage_gb=3500, # 3.5 TB currently used
growth_rate_daily_gb=15, # 15 GB per day growth
days=90, # 90-day forecast
total_capacity_gb=5000 # 5 TB total capacity
)
print(f"Current Usage: {result['current_usage_gb']} GB ({result['current_utilization_pct']:.1f}%)")
print(f"Forecasted Usage (90 days): {result['forecasted_usage_gb']} GB ({result['forecasted_utilization_pct']:.1f}%)")
print(f"Days Until Full: {result['days_to_full']:.0f}")
print(f"Recommendation: {result['recommendation']}")
# Output:
# Current Usage: 3500 GB (70.0%)
# Forecasted Usage (90 days): 4850 GB (97.0%)
# Days Until Full: 100
# Recommendation: Add storage
```
### Capacity Planning Dashboard Metrics
```yaml
CPU Capacity Dashboard:
- Current Utilization: Gauge (0-100%)
- 30-Day Trend: Line graph
- Growth Rate: Percentage per month
- Months Until 80% Capacity: Number
- Peak Utilization: Histogram (by hour of day)
Memory Capacity Dashboard:
- Current Utilization: Gauge (0-100%)
- Available Memory: GB
- Memory Pressure Events: Count per day
- Top Memory Consumers: Table (process, usage)
Storage Capacity Dashboard:
- Disk Usage by Volume: Bar chart
- Growth Rate: GB per day
- Days Until Full: Number (by volume)
- Largest Files/Directories: Table
Network Capacity Dashboard:
- Bandwidth Utilization: Gauge (% of total)
- Peak Throughput: Mbps
- Connection Count: Time series
- Network Errors: Count per minute
```
## Storage Management
### Storage Types and Use Cases
```yaml
Direct Attached Storage (DAS):
Description: Storage directly connected to server (internal drives)
Use Cases:
- Operating system
- Local caching
- Temporary files
Pros: Fast, simple, low cost
Cons: Not shared, limited capacity, no redundancy
Network Attached Storage (NAS):
Description: File-level storage over network (NFS, SMB/CIFS)
Use Cases:
- File shares
- Home directories
- Backup target
Pros: Easy to share, centralized management
Cons: Network dependent, file-level only
Storage Area Network (SAN):
Description: Block-level storage over dedicated network (FC, iSCSI)
Use Cases:
- Databases
- Virtual machine storage
- High-performance applications
Pros: High performance, flexible, scalable
Cons: Expensive, complex
Object Storage:
Description: Object/blob storage with metadata (S3, Azure Blob)
Use Cases:
- Backups
- Archives
- Media files
- Static website content
Pros: Unlimited scale, durable, cost-effective
Cons: Higher latency, no POSIX filesystem
```
### RAID Configurations
```yaml
RAID 0 (Striping):
Configuration: Data split across drives
Minimum Drives: 2
Usable Capacity: 100%
Performance: Excellent (read & write)
Redundancy: None (any drive failure = data loss)
Use Case: Non-critical, high-performance (caching)
RAID 1 (Mirroring):
Configuration: Identical copies on each drive
Minimum Drives: 2
Usable Capacity: 50%
Performance: Good reads, moderate writes
Redundancy: Can lose 1 drive
Use Case: OS drives, critical data, small arrays
RAID 5 (Striping with Parity):
Configuration: Data + parity distributed across drives
Minimum Drives: 3
Usable Capacity: (N-1)/N (e.g., 3 drives = 67%)
Performance: Good reads, moderate writes
Redundancy: Can lose 1 drive
Use Case: File servers, general purpose
RAID 6 (Striping with Double Parity):
Configuration: Data + 2 parity blocks distributed
Minimum Drives: 4
Usable Capacity: (N-2)/N (e.g., 4 drives = 50%)
Performance: Good reads, slower writes
Redundancy: Can lose 2 drives
Use Case: Large arrays, critical data
RAID 10 (1+0, Mirrored Stripes):
Configuration: Striped set of mirrors
Minimum Drives: 4
Usable Capacity: 50%
Performance: Excellent (read & write)
Redundancy: Can lose 1 drive per mirror
Use Case: Databases, high-performance applications
Recommendation:
- OS drives: RAID 1 (or RAID 10 for performance)
- Database: RAID 10 (best performance + redundancy)
- File servers: RAID 5 or RAID 6 (capacity + redundancy)
- Backup: RAID 6 (large capacity, double redundancy)
```
### LVM (Logical Volume Management)
```bash
# LVM Setup (Linux)
# 1. Initialize physical volumes
pvcreate /dev/sdb
pvcreate /dev/sdc
pvdisplay
# 2. Create volume group
vgcreate data_vg /dev/sdb /dev/sdc
vgdisplay data_vg
# 3. Create logical volumes
lvcreate -L 500G -n database_lv data_vg
lvcreate -L 1T -n backups_lv data_vg
lvdisplay
# 4. Create filesystems
mkfs.ext4 /dev/data_vg/database_lv
mkfs.xfs /dev/data_vg/backups_lv
# 5. Mount filesystems
mkdir -p /data/database /data/backups
mount /dev/data_vg/database_lv /data/database
mount /dev/data_vg/backups_lv /data/backups
# 6. Add to /etc/fstab for persistence
echo "/dev/data_vg/database_lv /data/database ext4 defaults 0 2" >> /etc/fstab
echo "/dev/data_vg/backups_lv /data/backups xfs defaults 0 2" >> /etc/fstab
# Expand logical volume (online resize)
lvextend -L +200G /dev/data_vg/database_lv
resize2fs /dev/data_vg/database_lv # ext4
xfs_growfs /data/backups # xfs
# Create snapshot (for backups)
lvcreate -L 50G -s -n database_snap /dev/data_vg/database_lv
mount /dev/data_vg/database_snap /mnt/snapshot
# ... perform backup from /mnt/snapshot ...
umount /mnt/snapshot
lvremove /dev/data_vg/database_snap
```
## Virtualization
### Virtualization Platforms
```yaml
VMware vSphere/ESXi:
Type: Type-1 Hypervisor (bare metal)
Pros: Mature, feature-rich, excellent management (vCenter)
Cons: Expensive licensing
Use Case: Enterprise environments, large deployments
KVM (Kernel-based Virtual Machine):
Type: Type-1 Hypervisor (Linux kernel module)
Pros: Open source, high performance, flexible
Cons: Management tools less mature than VMware
Use Case: Linux-heavy environments, cost-conscious
Microsoft Hyper-V:
Type: Type-1 Hypervisor
Pros: Tight Windows integration, free with Windows Server
Cons: Linux guest support limited
Use Case: Windows-heavy environments
Proxmox VE:
Type: Type-1 Hypervisor (KVM + LXC)
Pros: Open source, web UI, container support
Cons: Smaller ecosystem than VMware
Use Case: Small to medium deployments, mixed VM/container
```
### VM Management with KVM/QEMU
```bash
# Install KVM on Ubuntu
apt-get install -y qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils virt-manager
# Start libvirt service
systemctl enable libvirtd
systemctl start libvirtd
# Create VM from command line
virt-install
--name web-server-vm
--ram 4096
--vcpus 2
--disk path=/var/lib/libvirt/images/web-server.qcow2,size=50
--os-type linux
--os-variant ubuntu20.04
--network bridge=br0
--graphics vnc,listen=0.0.0.0
--console pty,target_type=serial
--cdrom /var/lib/libvirt/images/ubuntu-20.04-server.iso
# List VMs
virsh list --all
# Start/stop VM
virsh start web-server-vm
virsh shutdown web-server-vm
virsh destroy web-server-vm # force stop
# Connect to VM console
virsh console web-server-vm
# Clone VM
virt-clone
--original web-server-vm
--name web-server-vm-clone
--file /var/lib/libvirt/images/web-server-clone.qcow2
# Take snapshot
virsh snapshot-create-as web-server-vm snapshot1 "Before upgrade"
# List snapshots
virsh snapshot-list web-server-vm
# Revert to snapshot
virsh snapshot-revert web-server-vm snapshot1
# Export VM (backup)
virsh dumpxml web-server-vm > web-server-vm.xml
cp /var/lib/libvirt/images/web-server.qcow2 /backups/
# Import VM (restore)
virsh define web-server-vm.xml
cp /backups/web-server.qcow2 /var/lib/libvirt/images/
```
## Cloud Infrastructure
### Cloud Provider Comparison
| Feature | AWS | Azure | GCP |
|---------|-----|-------|-----|
| **Market Share** | ~32% | ~23% | ~10% |
| **Compute** | EC2 | Virtual Machines | Compute Engine |
| **Containers** | ECS, EKS | AKS | GKE |
| **Serverless** | Lambda | Functions | Cloud Functions |
| **Storage (Object)** | S3 | Blob Storage | Cloud Storage |
| **Storage (Block)** | EBS | Managed Disks | Persistent Disks |
| **Database (SQL)** | RDS | SQL Database | Cloud SQL |
| **Database (NoSQL)** | DynamoDB | Cosmos DB | Firestore/Bigtable |
| **Networking** | VPC | Virtual Network | VPC |
| **Load Balancer** | ELB/ALB | Load Balancer | Cloud Load Balancing |
| **DNS** | Route 53 | DNS | Cloud DNS |
| **CDN** | CloudFront | CDN | Cloud CDN |
| **Pricing** | $$$ | $$$ | $$$ |
### AWS EC2 Management
```bash
# AWS CLI - EC2 Management
# List instances
aws ec2 describe-instances
--filters "Name=tag:Environment,Values=production"
--query 'Reservations[*].Instances[*].[InstanceId,InstanceType,State.Name,PrivateIpAddress]'
--output table
# Start instance
aws ec2 start-instances --instance-ids i-1234567890abcdef0
# Stop instance
aws ec2 stop-instances --instance-ids i-1234567890abcdef0
# Create AMI (backup/template)
aws ec2 create-image
--instance-id i-1234567890abcdef0
--name "web-server-backup-$(date +%Y%m%d)"
--description "Backup before upgrade"
# Launch new instance from AMI
aws ec2 run-instances
--image-id ami-0abcdef1234567890
--count 1
--instance-type t3.medium
--key-name my-key-pair
--security-group-ids sg-0123456789abcdef0
--subnet-id subnet-0123456789abcdef0
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=web-server-03}]'
# Create snapshot of EBS volume
aws ec2 create-snapshot
--volume-id vol-1234567890abcdef0
--description "Daily backup"
# Modify instance type (requires stop)
aws ec2 stop-instances --instance-ids i-1234567890abcdef0
aws ec2 modify-instance-attribute
--instance-id i-1234567890abcdef0
--instance-type "{"Value": "t3.large"}"
aws ec2 start-instances --instance-ids i-1234567890abcdef0
```
## Infrastructure as Code
### Terraform Example
```hcl
# main.tf - Web server infrastructure
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
bucket = "my-terraform-state"
key = "web-servers/terraform.tfstate"
region = "us-east-1"
}
}
provider "aws" {
region = var.aws_region
}
# Variables
variable "aws_region" {
default = "us-east-1"
}
variable "instance_count" {
default = 3
}
variable "instance_type" {
default = "t3.medium"
}
# Data source - Latest Ubuntu AMI
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-*"]
}
}
# Security Group
resource "aws_security_group" "web" {
name = "web-servers-sg"
description = "Security group for web servers"
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["10.0.0.0/8"] # Internal only
}
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "web-servers-sg"
Environment = "production"
}
}
# EC2 Instances
resource "aws_instance" "web" {
count = var.instance_count
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
vpc_security_group_ids = [aws_security_group.web.id]
user_data = file("${path.module}/user-data.sh")
root_block_device {
volume_size = 50
volume_type = "gp3"
}
tags = {
Name = "web-server-${count.index + 1}"
Environment = "production"
Role = "web"
}
}
# Load Balancer
resource "aws_lb" "web" {
name = "web-lb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.web.id]
subnets = data.aws_subnets.default.ids
}
resource "aws_lb_target_group" "web" {
name = "web-tg"
port = 80
protocol = "HTTP"
vpc_id = data.aws_vpc.default.id
health_check {
path = "/health"
healthy_threshold = 2
unhealthy_threshold = 10
}
}
resource "aws_lb_target_group_attachment" "web" {
count = var.instance_count
target_group_arn = aws_lb_target_group.web.arn
target_id = aws_instance.web[count.index].id
port = 80
}
resource "aws_lb_listener" "web" {
load_balancer_arn = aws_lb.web.arn
port = "80"
protocol = "HTTP"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.web.arn
}
}
# Outputs
output "instance_ips" {
value = aws_instance.web[*].private_ip
}
output "load_balancer_dns" {
value = aws_lb.web.dns_name
}
```
## Patching and Updates
### Patch Management Process
```yaml
Phase 1: Planning (Monthly)
Actions:
- Review vendor security bulletins
- Identify critical and high-priority patches
- Test patches in dev/staging environment
- Schedule maintenance window
- Get change approval
Prioritization:
Critical: Security vulnerabilities (CVSS 9-10) - Apply within 7 days
High: Security vulnerabilities (CVSS 7-8) - Apply within 30 days
Medium: Bugs, moderate vulnerabilities - Apply within 90 days
Low: Feature updates, minor fixes - Apply on regular schedule
Phase 2: Testing (1-2 weeks before production)
Actions:
- Deploy patches to non-production environment
- Run automated tests
- Perform manual smoke tests
- Monitor for unexpected issues
- Document any compatibility issues
Test Criteria:
- Application starts successfully
- All critical functionality works
- No performance degradation
- No new errors in logs
Phase 3: Deployment (Maintenance Window)
Actions:
- Communicate to stakeholders
- Take pre-patch snapshot/backup
- Deploy patches in stages (canary approach)
- Monitor system health
- Validate functionality
- Document results
Rollout Strategy:
- Non-production: 100% at once
- Production: 10% → 50% → 100% with monitoring
Phase 4: Validation (Post-deployment)
Actions:
- Run post-patch tests
- Monitor for 24-48 hours
- Check error rates, performance metrics
- Rollback if issues detected
- Document lessons learned
```
### Automated Patching Scripts
**Linux (Ubuntu/Debian)**:
```bash
#!/bin/bash
# Automated patch management script
set -e
LOG_FILE="/var/log/patch-management.log"
EMAIL_TO="[email protected]"
log() {
echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1" | tee -a $LOG_FILE
}
# Pre-patch checks
log "Starting pre-patch checks..."
df -h > /tmp/disk-before.txt
free -h > /tmp/memory-before.txt
systemctl list-units --state=failed > /tmp/failed-services-before.txt
# Create snapshot (if using LVM)
log "Creating LVM snapshot..."
lvcreate -L 10G -s -n root_snap /dev/vg0/root
# Update package list
log "Updating package list..."
apt-get update
# List available updates
log "Available updates:"
apt list --upgradable | tee -a $LOG_FILE
# Install security updates only
log "Installing security updates..."
unattended-upgrade -d
# Or install all updates:
# apt-get upgrade -y
# Check if reboot required
if [ -f /var/run/reboot-required ]; then
log "Reboot required after patching"
cat /var/run/reboot-required.pkgs >> $LOG_FILE
# Schedule reboot (or reboot immediately)
log "Scheduling reboot in 5 minutes..."
shutdown -r +5 "System reboot for patches"
fi
# Post-patch validation
log "Running post-patch validation..."
systemctl list-units --state=failed > /tmp/failed-services-after.txt
# Compare before/after
if diff /tmp/failed-services-before.txt /tmp/failed-services-after.txt > /dev/null; then
log "No new failed services after patching"
else
log "WARNING: New failed services detected!"
diff /tmp/failed-services-before.txt /tmp/failed-services-after.txt | tee -a $LOG_FILE
fi
# Email report
mail -s "Patch Report: $(hostname)" $EMAIL_TO < $LOG_FILE
log "Patching complete"
```
**Windows (PowerShell)**:
```powershell
# Automated Windows patching script
$LogFile = "C:Logspatch-management.log"
$EmailTo = "[email protected]"
function Write-Log {
param([string]$Message)
$timestamp = Get-Date -Format "yyyy-MM-dd HH:mm:ss"
$logMessage = "[$timestamp] $Message"
Write-Host $logMessage
Add-Content -Path $LogFile -Value $logMessage
}
# Install PSWindowsUpdate module if not present
if (-not (Get-Module -ListAvailable -Name PSWindowsUpdate)) {
Write-Log "Installing PSWindowsUpdate module..."
Install-Module PSWindowsUpdate -Force
}
Import-Module PSWindowsUpdate
# Pre-patch checks
Write-Log "Starting pre-patch checks..."
Get-Service | Where-Object {$_.Status -eq "Stopped"} | Out-File C:Tempstopped-services-before.txt
# Create system restore point
Write-Log "Creating system restore point..."
Checkpoint-Computer -Description "Before Windows Updates" -RestorePointType MODIFY_SETTINGS
# Get available updates
Write-Log "Checking for updates..."
$updates = Get-WindowsUpdate
Write-Log "Available updates: $($updates.Count)"
$updates | Format-Table Title, KB, Size | Out-String | Write-Log
# Install updates (excluding driver updates)
Write-Log "Installing updates..."
Install-WindowsUpdate -AcceptAll -IgnoreReboot -NotCategory "Drivers" | Out-String | Write-Log
# Check if reboot required
if (Get-WURebootStatus -Silent) {
Write-Log "Reboot required after updates"
# Schedule reboot (or reboot immediately)
Write-Log "Scheduling reboot in 5 minutes..."
shutdown /r /t 300 /c "System reboot for Windows Updates"
}
# Post-patch validation
Write-Log "Running post-patch validation..."
Get-Service | Where-Object {$_.Status -eq "Stopped"} | Out-File C:Tempstopped-services-after.txt
# Email report
Send-MailMessage `
-From "[email protected]" `
-To $EmailTo `
-Subject "Patch Report: $env:COMPUTERNAME" `
-Body (Get-Content $LogFile | Out-String) `
-SmtpServer "smtp.example.com"
Write-Log "Patching complete"
```
## Performance Optimization
### System Performance Tuning
**Linux Kernel Tuning**:
```bash
# /etc/sysctl.conf - Kernel parameter tuning
# Network tuning
net.core.somaxconn = 4096 # Max socket connections
net.core.netdev_max_backlog = 5000 # Network device queue
net.ipv4.tcp_max_syn_backlog = 8192 # SYN backlog queue
net.ipv4.tcp_fin_timeout = 15 # FIN timeout (default 60)
net.ipv4.tcp_keepalive_time = 300 # Keep-alive time
net.ipv4.tcp_tw_reuse = 1 # Reuse TIME_WAIT sockets
net.ipv4.ip_local_port_range = 10240 65535 # Ephemeral port range
# Memory tuning
vm.swappiness = 10 # Reduce swap usage (default 60)
vm.dirty_ratio = 15 # Max dirty pages before write
vm.dirty_background_ratio = 5 # Background write threshold
# File system tuning
fs.file-max = 500000 # Max open files system-wide
fs.inotify.max_user_watches = 524288 # Max inotify watches
# Apply changes
sysctl -p
```
**Application Tuning (Nginx Example)**:
```nginx
# /etc/nginx/nginx.conf - Performance tuning
user www-data;
worker_processes auto; # One per CPU core
worker_rlimit_nofile 65535;
events {
worker_connections 4096;
use epoll; # Efficient event model on Linux
multi_accept on;
}
http {
# Basic settings
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 65;
types_hash_max_size 2048;
server_tokens off; # Security: hide version
# Buffer sizes
client_body_buffer_size 128k;
client_max_body_size 50m;
client_header_buffer_size 1k;
large_client_header_buffers 4 16k;
output_buffers 1 32k;
postpone_output 1460;
# Timeouts
client_body_timeout 12;
client_header_timeout 12;
send_timeout 10;
# Gzip compression
gzip on;
gzip_vary on;
gzip_proxied any;
gzip_comp_level 6;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml;
# Caching
open_file_cache max=200000 inactive=20s;
open_file_cache_valid 30s;
open_file_cache_min_uses 2;
open_file_cache_errors on;
# Rate limiting
limit_req_zone $binary_remote_addr zone=one:10m rate=10r/s;
limit_conn_zone $binary_remote_addr zone=addr:10m;
server {
listen 80;
location / {
limit_req zone=one burst=20 nodelay;
limit_conn addr 10;
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_buffering on;
proxy_buffer_size 4k;
proxy_buffers 24 4k;
proxy_busy_buffers_size 8k;
}
}
}
```
## Cost Management
### Cloud Cost Optimization Strategies
```yaml
1. Right-Sizing:
- Analyze resource utilization (CPU, memory)
- Downsize over-provisioned instances
- Upsize under-provisioned instances (to avoid performance issues)
Tools:
- AWS: AWS Compute Optimizer
- Azure: Azure Advisor
- GCP: Recommender
Expected Savings: 20-40%
2. Reserved Instances / Savings Plans:
- Commit to 1-year or 3-year usage
- Save up to 72% vs on-demand
- Analyze usage patterns first
Best For:
- Steady-state workloads (production databases, web servers)
- Don't use for: Dev/test, variable workloads
Expected Savings: 30-70%
3. Spot Instances:
- Use spare cloud capacity at discounted rates (up to 90% off)
- Can be interrupted with 2-minute notice
Best For:
- Batch processing, big data, CI/CD
- Stateless, fault-tolerant workloads
Expected Savings: 50-90%
4. Auto-Scaling:
- Scale down during off-hours
- Scale up during peak demand
Example Schedule:
- Business hours (8am-6pm): 10 instances
- Off-hours (6pm-8am): 3 instances
- Weekends: 2 instances
Expected Savings: 30-50%
5. Storage Optimization:
- Delete unused EBS volumes and snapshots
- Move infrequently accessed data to cheaper tiers
- S3 Standard → S3 Infrequent Access → S3 Glacier
- Enable S3 lifecycle policies
Expected Savings: 20-60% on storage
6. Serverless:
- Replace idle servers with Lambda/Functions
- Pay only for execution time
Best For:
- APIs with variable load
- Event-driven processing
- Scheduled tasks
Expected Savings: 50-80% for low-to-moderate traffic
```
### Cost Monitoring Dashboard
```yaml
Cloud Cost Dashboard (Monthly):
Top Spenders:
- Service breakdown (EC2, RDS, S3, etc.)
- Top 10 resources by cost
- Cost by team/project (using tags)
Trend Analysis:
- Month-over-month cost change
- Year-over-year comparison
- Forecast for next 3 months
Waste Identification:
- Unused resources (stopped instances, unattached volumes)
- Over-provisioned resources (< 30% utilization)
- Untagged resources
Savings Opportunities:
- RI/Savings Plan recommendations
- Right-sizing recommendations
- Storage tier recommendations
Budget Alerts:
- Warning at 80% of budget
- Critical at 100% of budget
- Forecast to exceed budget
```
This comprehensive infrastructure management guide provides all the necessary knowledge and tools for effective IT operations.
Source: claude-code-templates (MIT). See About Us for full credits.