System Monitoring and Auto-Recovery
1.1 Overview
1.2 Objectives of System Monitoring and Auto-Recovery
Track performance, detect anomalies, and capture metrics for real-time status visibility.
Automatically respond to incidents through preconfigured rules, reducing downtime.
Send immediate alerts to relevant team members in the event of critical failures.
Implement auto-recovery actions for commonly encountered issues to maintain system continuity.
1.3 Tools for System Monitoring and Auto-Recovery
A monitoring and alerting toolkit used to collect metrics and generate alerts.
Provides visualizations and dashboards for Prometheus metrics.
Manages logs and offers search and visualization capabilities.
Provides cloud-specific monitoring for AWS or GCP environments, allowing infrastructure and application tracking.
Built-in readiness and liveness probes for automatic recovery of services.
Infrastructure as code tools to restore system state in the event of hardware or configuration failures.
1.4 Monitoring Code (Prometheus and Grafana)
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'capsurelabs_app'
static_configs:
- targets: ['localhost:8080']
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- "alert_rules.yml"# alert_rules.yml
groups:
- name: cpu_alert
rules:
- alert: HighCPUUsage
expr: avg(rate(container_cpu_usage_seconds_total[5m])) > 0.8
for: 5m
labels:
severity: "critical"
annotations:
summary: "High CPU Usage Detected"
description: "CPU usage is above 80% for the past 5 minutes."1.5 Auto-Recovery Mechanisms
1.5.1 Kubernetes Liveness and Readiness Probes
# deployment.yaml
spec:
containers:
- name: capsurelabs-service
image: capsurelabs/service-image:latest
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /readiness
port: 8080
initialDelaySeconds: 5
periodSeconds: 101.5.2 Ansible for Automated Recovery
# ansible-playbook.yml
- name: Recover Database
hosts: database_servers
tasks:
- name: Check Database Health
command: "pg_isready"
register: db_status
ignore_errors: yes
- name: Restart Database
service:
name: postgresql
state: restarted
when: db_status.rc != 01.5.3 AWS Auto-Recovery for EC2 Instances
{
"AlarmName": "Instance-HealthCheck-Failure",
"AlarmDescription": "Triggers recovery if instance is unhealthy",
"AWSAccountId": "YOUR_AWS_ACCOUNT_ID",
"MetricName": "StatusCheckFailed",
"Namespace": "AWS/EC2",
"Statistic": "Average",
"Period": 300,
"EvaluationPeriods": 2,
"Threshold": 1,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"ActionsEnabled": true,
"AlarmActions": [
"arn:aws:automate:YOUR_REGION:ec2:recover"
],
"Dimensions": [
{
"Name": "InstanceId",
"Value": "YOUR_INSTANCE_ID"
}
]
}Last updated
