Clustering Guide

Guide to setting up and managing clustered environments with Lambda Softworks' automation scripts.

This guide covers clustering configuration and management for scalable, distributed systems using Lambda Softworks' automation tools.

Clustering Basics

Core Concepts

  1. Node Management

    • Node roles
    • Node discovery
    • Health monitoring
    • Resource allocation
  2. Data Distribution

    • Sharding strategies
    • Replication methods
    • Consistency models
    • Data balancing
  3. Service Orchestration

    • Service discovery
    • Load balancing
    • Service scaling
    • Resource scheduling

Cluster Setup

Basic Configuration

# Initialize cluster
./cluster-setup.sh --init \
  --nodes "node1,node2,node3" \
  --services "web,db,cache" \
  --network "10.0.0.0/24"

# Add node to cluster
./cluster-setup.sh --add-node \
  --name node4 \
  --ip 10.0.0.14 \
  --role worker

Advanced Configuration

# Configure advanced features
./cluster-setup.sh --advanced \
  --scheduler kubernetes \
  --storage-driver ceph \
  --network-plugin calico

# Set up monitoring
./cluster-setup.sh --monitoring \
  --prometheus \
  --grafana \
  --alert-manager

Configuration Files

Basic Cluster Configuration

# /etc/lambdasoftworks/cluster/config.yml
cluster:
  name: "production-cluster"
  version: "1.0"
  
  nodes:
    master:
      - name: "master1"
        ip: "10.0.0.11"
        role: "control-plane"
        labels:
          type: "master"
          zone: "us-east-1a"
          
      - name: "master2"
        ip: "10.0.0.12"
        role: "control-plane"
        labels:
          type: "master"
          zone: "us-east-1b"
          
    worker:
      - name: "worker1"
        ip: "10.0.0.21"
        role: "worker"
        labels:
          type: "app"
          zone: "us-east-1a"
          
      - name: "worker2"
        ip: "10.0.0.22"
        role: "worker"
        labels:
          type: "db"
          zone: "us-east-1b"
  
  networking:
    pod_cidr: "172.16.0.0/16"
    service_cidr: "172.17.0.0/16"
    dns_domain: "cluster.local"
    
  services:
    - name: "web"
      replicas: 3
      ports:
        - 80
        - 443
      
    - name: "db"
      replicas: 2
      ports:
        - 3306
      storage: "100Gi"
      
    - name: "cache"
      replicas: 3
      ports:
        - 6379

Advanced Cluster Configuration

# /etc/lambdasoftworks/cluster/advanced-config.yml
cluster:
  name: "enterprise-cluster"
  
  control_plane:
    api_server:
      bind_address: "0.0.0.0"
      secure_port: 6443
      max_requests: 1500
      
    controller_manager:
      concurrent_deployment_syncs: 10
      deployment_controller_sync_period: 30s
      
    scheduler:
      algorithm:
        type: "DefaultProvider"
        policy:
          name: "LeastRequestedPriority"
          weight: 1
  
  networking:
    plugin: "calico"
    config:
      mtu: 1440
      ipip_mode: "Always"
      nat_outgoing: true
      
    load_balancer:
      type: "metallb"
      address_pools:
        - name: "default"
          protocol: "layer2"
          addresses:
            - "192.168.1.240-192.168.1.250"
    
    ingress:
      controller: "nginx"
      config:
        use_proxy_protocol: true
        enable_ssl_passthrough: true
  
  storage:
    class: "ceph-rbd"
    provisioner: "rbd.csi.ceph.com"
    parameters:
      clusterID: "ceph-cluster"
      pool: "rbd-pool"
      imageFormat: "2"
      imageFeatures: "layering"
    
  monitoring:
    prometheus:
      retention: "30d"
      storage: "100Gi"
      scrape_interval: "15s"
      
    grafana:
      admin_user: "admin"
      plugins:
        - "grafana-piechart-panel"
        - "grafana-clock-panel"
      
    alert_manager:
      receivers:
        - name: "slack"
          slack_configs:
            - channel: "#alerts"
        - name: "email"
          email_configs:
            - to: "admin@example.com"
  
  logging:
    engine: "elasticsearch"
    retention: "14d"
    indices:
      - name: "application"
        pattern: "app-logs-*"
        retention: "7d"
      - name: "system"
        pattern: "system-logs-*"
        retention: "30d"
    
  backup:
    schedule: "0 2 * * *"
    retention: "7d"
    storage:
      type: "s3"
      bucket: "cluster-backups"
      region: "us-east-1"

Service Configuration

Web Service Cluster

# /etc/lambdasoftworks/cluster/web-service.yml
apiVersion: v1
kind: Service
metadata:
  name: web-service
  namespace: production
spec:
  selector:
    app: web
  ports:
    - name: http
      port: 80
      targetPort: 8080
  type: LoadBalancer

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
        - name: web
          image: nginx:latest
          ports:
            - containerPort: 8080
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "1000m"
              memory: "1Gi"
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10

Database Cluster

# /etc/lambdasoftworks/cluster/database-cluster.yml
apiVersion: mysql.presslabs.org/v1alpha1
kind: MysqlCluster
metadata:
  name: mysql-cluster
  namespace: production
spec:
  replicas: 2
  secretName: mysql-secret
  
  volumeSpec:
    persistentVolumeClaim:
      storageClassName: ceph-rbd
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 100Gi
          
  mysqlConf:
    innodb-buffer-pool-size: 4G
    max_connections: 1000
    
  podSpec:
    resources:
      requests:
        cpu: "2000m"
        memory: "4Gi"
      limits:
        cpu: "4000m"
        memory: "8Gi"

Management Commands

Cluster Management

# Check cluster status
./cluster-manage.sh --status \
  --components all \
  --format detailed

# Scale service
./cluster-manage.sh --scale \
  --service web \
  --replicas 5 \
  --wait

Node Management

# Node maintenance
./cluster-manage.sh --maintenance \
  --node worker1 \
  --drain \
  --timeout 5m

# Add node
./cluster-manage.sh --add-node \
  --name worker3 \
  --role worker \
  --zone us-east-1c

Monitoring

Setup Monitoring

# Deploy monitoring stack
./cluster-monitor.sh --deploy \
  --components "prometheus,grafana,alertmanager" \
  --storage-size 100Gi

# Configure alerts
./cluster-monitor.sh --configure-alerts \
  --rules-file alerts.yml \
  --notification-channels "slack,email"

Example Alert Rules

# /etc/lambdasoftworks/cluster/alerts.yml
groups:
  - name: node
    rules:
      - alert: NodeDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} is down"
          
      - alert: HighCPUUsage
        expr: node_cpu_usage > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.node }}"
          
  - name: pod
    rules:
      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"

Best Practices

Design Principles

  1. Scalability

    • Horizontal scaling
    • Automated scaling
    • Resource optimization
    • Load distribution
  2. Reliability

    • Node redundancy
    • Service resilience
    • Data replication
    • Failure recovery
  3. Security

    • Network policies
    • Access control
    • Secret management
    • Audit logging

Operational Guidelines

  1. Deployment

    • Rolling updates
    • Canary deployments
    • Blue-green deployments
    • Rollback procedures
  2. Monitoring

    • Resource monitoring
    • Performance metrics
    • Log aggregation
    • Alert management
  3. Maintenance

    • Regular updates
    • Security patches
    • Backup verification
    • Capacity planning

Troubleshooting

Common Issues

  1. Node Problems
# Check node health
./cluster-manage.sh --diagnose-node \
  --node worker1 \
  --verbose

# Reset node
./cluster-manage.sh --reset-node \
  --node worker1 \
  --force
  1. Service Issues
# Debug service
./cluster-manage.sh --debug-service \
  --service web \
  --logs \
  --events

# Restart service
./cluster-manage.sh --restart-service \
  --service web \
  --rolling

Next Steps

Clustering Guide