Monitoring
This page describes how to monitor Infracast deployments, including health checks, CloudWatch metrics, key indicators to watch, and alerting recommendations.
Health Check Endpoint
All Infracast API servers expose a health check endpoint:
GET /healthz
Response when healthy:
{
"status": "ok",
"db_connected": true
}
Response when degraded (HTTP 503):
{
"status": "degraded",
"db_connected": false
}
Use this endpoint for:
- ALB target group health checks (configured automatically by Terraform)
- Docker healthcheck in Compose
- Uptime monitoring (Pingdom, UptimeRobot, etc.)
- Kubernetes liveness/readiness probes
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
CloudWatch (AWS / Terraform Deployments)
The Terraform module creates a CloudWatch log group and metric alarms automatically.
Log Groups
| Log Group | Content | Retention |
|---|---|---|
/ecs/vulcan-{env} | API server application logs | 30 days |
/ecs/vulcan-{env}/plugins | Plugin subprocess output | 7 days |
RDSOSMetrics | RDS Enhanced Monitoring | 30 days |
Viewing Logs
# Tail live logs
aws logs tail /ecs/vulcan-prod --follow
# Filter for errors
aws logs filter-log-events \
--log-group-name /ecs/vulcan-prod \
--filter-pattern "ERROR" \
--start-time $(date -d "1 hour ago" +%s000)
# Filter for a specific job ID
aws logs filter-log-events \
--log-group-name /ecs/vulcan-prod \
--filter-pattern '"job_id":"abc123"'
CloudWatch Metrics — ECS
| Metric | Namespace | Recommended Alarm |
|---|---|---|
CPUUtilization | AWS/ECS | Alert at >85% for 5 min |
MemoryUtilization | AWS/ECS | Alert at >90% for 5 min |
RunningTaskCount | AWS/ECS | Alert if < desired count |
CloudWatch Metrics — ALB
| Metric | Recommended Alarm |
|---|---|
TargetResponseTime | Alert at P95 > 2 seconds |
HTTPCode_Target_5XX_Count | Alert if > 10 in 5 min |
HealthyHostCount | Alert if < 1 |
UnHealthyHostCount | Alert if > 0 for 2 min |
CloudWatch Metrics — RDS
| Metric | Recommended Alarm |
|---|---|
CPUUtilization | Alert at >80% for 10 min |
DatabaseConnections | Alert at >80% of max |
FreeStorageSpace | Alert if < 10 GB |
ReadLatency / WriteLatency | Alert if P99 > 100ms |
ReplicaLag | Alert if > 30 seconds (Multi-AZ) |
Creating Alarms with Terraform
resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
alarm_name = "vulcan-${var.environment}-ecs-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
period = 300
statistic = "Average"
threshold = 85
dimensions = {
ClusterName = aws_ecs_cluster.main.name
ServiceName = aws_ecs_service.vulcan.name
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
Docker / Self-Hosted Monitoring
Docker Health Check
services:
vulcan-api:
image: ghcr.io/azgardtek/vulcan:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/healthz"]
interval: 30s
timeout: 10s
retries: 3
start_period: 20s
Container Metrics with Prometheus
Infracast API logs are structured JSON, compatible with most log shippers. For metrics, collect from the Docker stats API:
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8085:8080"
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
Log Format
Infracast outputs structured JSON logs. Each line is a self-contained JSON object:
{
"level": "info",
"time": "2026-04-16T12:00:00Z",
"caller": "api/server.go:142",
"msg": "request completed",
"method": "GET",
"path": "/api/v1/tenants/abc/nodes",
"status": 200,
"duration_ms": 42,
"tenant_id": "abc",
"user_id": "user-123",
"trace_id": "0abc123def"
}
Key Log Fields
| Field | Description |
|---|---|
level | debug, info, warn, error |
tenant_id | Tenant context (filter by this for tenant-specific issues) |
job_id | Discovery job ID |
plugin | Plugin name (for plugin-related logs) |
duration_ms | Request or operation duration |
error | Error message (only on error/warn) |
trace_id | Distributed trace ID |
Common Log Queries (CloudWatch Insights)
# Error rate by path
fields @timestamp, @message
| filter level = "error"
| stats count(*) as errors by path
| sort errors desc
| limit 20
# Slow requests (>1s)
fields @timestamp, method, path, duration_ms, tenant_id
| filter duration_ms > 1000
| sort duration_ms desc
# Plugin failures
fields @timestamp, plugin, job_id, @message
| filter level = "error" and ispresent(plugin)
| sort @timestamp desc
Key Metrics to Watch
Application Health
| Indicator | Healthy | Warning | Critical |
|---|---|---|---|
/healthz | status: ok | — | db_connected: false |
| API P95 latency | < 500ms | 500ms–2s | > 2s |
| 5xx error rate | < 0.1% | 0.1–1% | > 1% |
| ECS running tasks | = desired | < desired | 0 |
Database
| Indicator | Healthy | Warning | Critical |
|---|---|---|---|
| DB connections | < 70% max | 70–85% max | > 85% max |
| Free storage | > 20 GB | 10–20 GB | < 10 GB |
| CPU utilization | < 50% | 50–80% | > 80% |
Discovery Jobs
Watch for discovery jobs that stop completing — this often indicates credential rotation issues or network changes:
# Check for stuck jobs via API
curl -H "Authorization: Bearer $TOKEN" \
"https://api.infracast.io/api/v1/tenants/$TENANT/jobs?status=running&older_than=2h"
Uptime Monitoring
Configure an external uptime monitor for the health endpoint:
URL: https://api.infracast.io/healthz (SaaS)
https://your-domain.com/healthz (self-hosted)
Method: GET
Interval: 60 seconds
Expected: HTTP 200, body contains "ok"
Alert: After 2 consecutive failures
Alerting Recommendations
| Alert | Trigger | Channel | Priority |
|---|---|---|---|
| API unhealthy | /healthz != 200 for 2 min | PagerDuty/OpsGenie | P1 |
| ECS task count < desired | 2+ minutes | Slack/email | P2 |
| RDS storage < 10 GB | Immediate | PagerDuty | P2 |
| 5xx rate > 1% | 5 minutes | Slack | P2 |
| Database connections > 85% | 10 minutes | Slack | P3 |
| API P95 > 2 seconds | 5 minutes | Slack | P3 |
Configure Infracast's built-in notification system (Settings → Notifications) to alert on discovery job failures. This catches credential and network issues faster than infrastructure-level monitoring.