Server Monitoring and Alerting Setup

Designs a complete server monitoring, alerting, and dashboard plan tailored to your specific infrastructure.

// prompt
You are a senior Site Reliability Engineer who designs production-grade observability stacks. Design a complete monitoring and alerting setup for the environment below. ## Environment - **Infrastructure:** {{server_environment}} - **Monitoring Stack:** {{preferred_tooling}} - **Scale / Host Count:** {{number_of_hosts}} - **Critical Services / SLOs:** {{critical_services_and_slos}} - **On-Call / Notification Channels:** {{notification_channels}} ## Instructions 1. Start with the **golden signals** (latency, traffic, errors, saturation) and map each to concrete metrics worth collecting for this stack. 2. Cover **system metrics** (CPU, memory, disk I/O and space, network), **service/process health**, and **log/event monitoring**. 3. Define **alert thresholds** with warning vs. critical tiers, evaluation windows, and a clear rationale for each — explicitly noting how you avoid alert fatigue and flapping. 4. Specify **routing and escalation**: which alerts page on-call, which only notify, and escalation timing. 5. Recommend **dashboards** (panels, KPIs, and what each one answers) plus retention and capacity-planning views. 6. Suggest **automated responses** (self-healing, restart, or scaling triggers) only where they are safe. ## Deliver - A prioritized rollout plan in **3 phases** (must-have, should-have, nice-to-have). - A **metric + alert table**: `Metric | Source | Warning | Critical | Action`. - One ready-to-adapt config or collector snippet for the chosen stack, inside a fenced code block. - A short list of **common pitfalls** and how this design avoids them. Ask me for any missing details before finalizing. Keep recommendations tool-specific and production-ready, and flag any assumptions you make.
Fill in the variables
Example response

Comprehensive Monitoring Solution

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "node-exporter"
    static_configs:
      - targets: ["localhost:9100"]
  - job_name: "application"
    static_configs:
      - targets: ["localhost:3000"]

Alert Rules

# alerts.yml
groups:
  - name: system
    rules:
      - alert: HighCPUUsage
        expr: cpu_usage > 80
        for: 5m
        annotations:
          summary: "High CPU usage detected"

Grafana Dashboard

  • System Metrics: CPU, Memory, Disk usage panels
  • Application Metrics: Response time, error rates
  • Network Metrics: Bandwidth, connection counts

Automated Responses

#!/bin/bash
# Auto-scaling script
if [ $CPU_USAGE -gt 80 ]; then
    kubectl scale deployment app --replicas=5
fi

Notification Channels: Slack, Email, PagerDuty integration configured

Related prompts

IT & Administration

Cloud Infrastructure Architect

Design a scalable, secure, cost-optimized cloud architecture with IaC, diagrams, and a phased rollout plan.

IT & Administration

Cybersecurity Audit Specialist

Run a structured cybersecurity audit of an organization, prioritizing risks and producing an actionable remediation roadmap.

IT & Administration

DevOps Automation Specialist

Acts as a DevOps engineer to design, optimize, and troubleshoot CI/CD pipelines, infrastructure as code, and cloud automation.

IT & Administration

Ansible Automation Playbook Creator

Generates production-ready, idempotent Ansible playbooks and roles for any infrastructure automation or configuration task.