Skip to main content

Monitoring

This page describes how to monitor Infracast deployments, including health checks, CloudWatch metrics, key indicators to watch, and alerting recommendations.

Health Check Endpoint

All Infracast API servers expose a health check endpoint:

GET /healthz

Response when healthy:

{
"status": "ok",
"db_connected": true
}

Response when degraded (HTTP 503):

{
"status": "degraded",
"db_connected": false
}

Use this endpoint for:

  • ALB target group health checks (configured automatically by Terraform)
  • Docker healthcheck in Compose
  • Uptime monitoring (Pingdom, UptimeRobot, etc.)
  • Kubernetes liveness/readiness probes
Kubernetes probe example
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 30

readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10

CloudWatch (AWS / Terraform Deployments)

The Terraform module creates a CloudWatch log group and metric alarms automatically.

Log Groups

Log GroupContentRetention
/ecs/vulcan-{env}API server application logs30 days
/ecs/vulcan-{env}/pluginsPlugin subprocess output7 days
RDSOSMetricsRDS Enhanced Monitoring30 days

Viewing Logs

# Tail live logs
aws logs tail /ecs/vulcan-prod --follow

# Filter for errors
aws logs filter-log-events \
--log-group-name /ecs/vulcan-prod \
--filter-pattern "ERROR" \
--start-time $(date -d "1 hour ago" +%s000)

# Filter for a specific job ID
aws logs filter-log-events \
--log-group-name /ecs/vulcan-prod \
--filter-pattern '"job_id":"abc123"'

CloudWatch Metrics — ECS

MetricNamespaceRecommended Alarm
CPUUtilizationAWS/ECSAlert at >85% for 5 min
MemoryUtilizationAWS/ECSAlert at >90% for 5 min
RunningTaskCountAWS/ECSAlert if < desired count

CloudWatch Metrics — ALB

MetricRecommended Alarm
TargetResponseTimeAlert at P95 > 2 seconds
HTTPCode_Target_5XX_CountAlert if > 10 in 5 min
HealthyHostCountAlert if < 1
UnHealthyHostCountAlert if > 0 for 2 min

CloudWatch Metrics — RDS

MetricRecommended Alarm
CPUUtilizationAlert at >80% for 10 min
DatabaseConnectionsAlert at >80% of max
FreeStorageSpaceAlert if < 10 GB
ReadLatency / WriteLatencyAlert if P99 > 100ms
ReplicaLagAlert if > 30 seconds (Multi-AZ)

Creating Alarms with Terraform

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
alarm_name = "vulcan-${var.environment}-ecs-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
period = 300
statistic = "Average"
threshold = 85

dimensions = {
ClusterName = aws_ecs_cluster.main.name
ServiceName = aws_ecs_service.vulcan.name
}

alarm_actions = [aws_sns_topic.alerts.arn]
}

Docker / Self-Hosted Monitoring

Docker Health Check

docker-compose.yml
services:
vulcan-api:
image: ghcr.io/azgardtek/vulcan:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/healthz"]
interval: 30s
timeout: 10s
retries: 3
start_period: 20s

Container Metrics with Prometheus

Infracast API logs are structured JSON, compatible with most log shippers. For metrics, collect from the Docker stats API:

docker-compose.monitoring.yml
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8085:8080"

prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"

Log Format

Infracast outputs structured JSON logs. Each line is a self-contained JSON object:

{
"level": "info",
"time": "2026-04-16T12:00:00Z",
"caller": "api/server.go:142",
"msg": "request completed",
"method": "GET",
"path": "/api/v1/tenants/abc/nodes",
"status": 200,
"duration_ms": 42,
"tenant_id": "abc",
"user_id": "user-123",
"trace_id": "0abc123def"
}

Key Log Fields

FieldDescription
leveldebug, info, warn, error
tenant_idTenant context (filter by this for tenant-specific issues)
job_idDiscovery job ID
pluginPlugin name (for plugin-related logs)
duration_msRequest or operation duration
errorError message (only on error/warn)
trace_idDistributed trace ID

Common Log Queries (CloudWatch Insights)

# Error rate by path
fields @timestamp, @message
| filter level = "error"
| stats count(*) as errors by path
| sort errors desc
| limit 20

# Slow requests (>1s)
fields @timestamp, method, path, duration_ms, tenant_id
| filter duration_ms > 1000
| sort duration_ms desc

# Plugin failures
fields @timestamp, plugin, job_id, @message
| filter level = "error" and ispresent(plugin)
| sort @timestamp desc

Key Metrics to Watch

Application Health

IndicatorHealthyWarningCritical
/healthzstatus: okdb_connected: false
API P95 latency< 500ms500ms–2s> 2s
5xx error rate< 0.1%0.1–1%> 1%
ECS running tasks= desired< desired0

Database

IndicatorHealthyWarningCritical
DB connections< 70% max70–85% max> 85% max
Free storage> 20 GB10–20 GB< 10 GB
CPU utilization< 50%50–80%> 80%

Discovery Jobs

Watch for discovery jobs that stop completing — this often indicates credential rotation issues or network changes:

# Check for stuck jobs via API
curl -H "Authorization: Bearer $TOKEN" \
"https://api.infracast.io/api/v1/tenants/$TENANT/jobs?status=running&older_than=2h"

Uptime Monitoring

Configure an external uptime monitor for the health endpoint:

URL: https://api.infracast.io/healthz  (SaaS)
https://your-domain.com/healthz (self-hosted)
Method: GET
Interval: 60 seconds
Expected: HTTP 200, body contains "ok"
Alert: After 2 consecutive failures

Alerting Recommendations

AlertTriggerChannelPriority
API unhealthy/healthz != 200 for 2 minPagerDuty/OpsGenieP1
ECS task count < desired2+ minutesSlack/emailP2
RDS storage < 10 GBImmediatePagerDutyP2
5xx rate > 1%5 minutesSlackP2
Database connections > 85%10 minutesSlackP3
API P95 > 2 seconds5 minutesSlackP3
Discovery Job Alerting

Configure Infracast's built-in notification system (Settings → Notifications) to alert on discovery job failures. This catches credential and network issues faster than infrastructure-level monitoring.