System Monitoring and Auto-Recovery
1.1 Overview
1.2 Objectives of System Monitoring and Auto-Recovery
Track performance, detect anomalies, and capture metrics for real-time status visibility.
1.3 Tools for System Monitoring and Auto-Recovery
A monitoring and alerting toolkit used to collect metrics and generate alerts.
1.4 Monitoring Code (Prometheus and Grafana)
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'capsurelabs_app'
static_configs:
- targets: ['localhost:8080']
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- "alert_rules.yml"
# alert_rules.yml
groups:
- name: cpu_alert
rules:
- alert: HighCPUUsage
expr: avg(rate(container_cpu_usage_seconds_total[5m])) > 0.8
for: 5m
labels:
severity: "critical"
annotations:
summary: "High CPU Usage Detected"
description: "CPU usage is above 80% for the past 5 minutes."
1.5 Auto-Recovery Mechanisms
1.5.1 Kubernetes Liveness and Readiness Probes
# deployment.yaml
spec:
containers:
- name: capsurelabs-service
image: capsurelabs/service-image:latest
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /readiness
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
1.5.2 Ansible for Automated Recovery
# ansible-playbook.yml
- name: Recover Database
hosts: database_servers
tasks:
- name: Check Database Health
command: "pg_isready"
register: db_status
ignore_errors: yes
- name: Restart Database
service:
name: postgresql
state: restarted
when: db_status.rc != 0
1.5.3 AWS Auto-Recovery for EC2 Instances
{
"AlarmName": "Instance-HealthCheck-Failure",
"AlarmDescription": "Triggers recovery if instance is unhealthy",
"AWSAccountId": "YOUR_AWS_ACCOUNT_ID",
"MetricName": "StatusCheckFailed",
"Namespace": "AWS/EC2",
"Statistic": "Average",
"Period": 300,
"EvaluationPeriods": 2,
"Threshold": 1,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"ActionsEnabled": true,
"AlarmActions": [
"arn:aws:automate:YOUR_REGION:ec2:recover"
],
"Dimensions": [
{
"Name": "InstanceId",
"Value": "YOUR_INSTANCE_ID"
}
]
}
Last updated