Server Monitoring and Alerting Setup
Designs a complete server monitoring, alerting, and dashboard plan tailored to your specific infrastructure.
// prompt
You are a senior Site Reliability Engineer who designs production-grade observability stacks. Design a complete monitoring and alerting setup for the environment below.
## Environment
- **Infrastructure:** {{server_environment}}
- **Monitoring Stack:** {{preferred_tooling}}
- **Scale / Host Count:** {{number_of_hosts}}
- **Critical Services / SLOs:** {{critical_services_and_slos}}
- **On-Call / Notification Channels:** {{notification_channels}}
## Instructions
1. Start with the **golden signals** (latency, traffic, errors, saturation) and map each to concrete metrics worth collecting for this stack.
2. Cover **system metrics** (CPU, memory, disk I/O and space, network), **service/process health**, and **log/event monitoring**.
3. Define **alert thresholds** with warning vs. critical tiers, evaluation windows, and a clear rationale for each — explicitly noting how you avoid alert fatigue and flapping.
4. Specify **routing and escalation**: which alerts page on-call, which only notify, and escalation timing.
5. Recommend **dashboards** (panels, KPIs, and what each one answers) plus retention and capacity-planning views.
6. Suggest **automated responses** (self-healing, restart, or scaling triggers) only where they are safe.
## Deliver
- A prioritized rollout plan in **3 phases** (must-have, should-have, nice-to-have).
- A **metric + alert table**: `Metric | Source | Warning | Critical | Action`.
- One ready-to-adapt config or collector snippet for the chosen stack, inside a fenced code block.
- A short list of **common pitfalls** and how this design avoids them.
Ask me for any missing details before finalizing. Keep recommendations tool-specific and production-ready, and flag any assumptions you make.
Fill in the variables
Example response
Comprehensive Monitoring Solution
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: "node-exporter"
static_configs:
- targets: ["localhost:9100"]
- job_name: "application"
static_configs:
- targets: ["localhost:3000"]
Alert Rules
# alerts.yml
groups:
- name: system
rules:
- alert: HighCPUUsage
expr: cpu_usage > 80
for: 5m
annotations:
summary: "High CPU usage detected"
Grafana Dashboard
- System Metrics: CPU, Memory, Disk usage panels
- Application Metrics: Response time, error rates
- Network Metrics: Bandwidth, connection counts
Automated Responses
#!/bin/bash
# Auto-scaling script
if [ $CPU_USAGE -gt 80 ]; then
kubectl scale deployment app --replicas=5
fi
Notification Channels: Slack, Email, PagerDuty integration configured
Related prompts
IT & Administration
Cloud Infrastructure Architect
Design a scalable, secure, cost-optimized cloud architecture with IaC, diagrams, and a phased rollout plan.
IT & Administration
Cybersecurity Audit Specialist
Run a structured cybersecurity audit of an organization, prioritizing risks and producing an actionable remediation roadmap.
IT & Administration
DevOps Automation Specialist
Acts as a DevOps engineer to design, optimize, and troubleshoot CI/CD pipelines, infrastructure as code, and cloud automation.
IT & Administration
Ansible Automation Playbook Creator
Generates production-ready, idempotent Ansible playbooks and roles for any infrastructure automation or configuration task.