Monitoring

This page describes how to monitor Infracast deployments, including health checks, CloudWatch metrics, key indicators to watch, and alerting recommendations.

Health Check Endpoint

All Infracast API servers expose a health check endpoint:

GET /healthz

Response when healthy:

{
  "status": "ok",
  "db_connected": true
}

Response when degraded (HTTP 503):

{
  "status": "degraded",
  "db_connected": false
}

Use this endpoint for:

ALB target group health checks (configured automatically by Terraform)
Docker healthcheck in Compose
Uptime monitoring (Pingdom, UptimeRobot, etc.)
Kubernetes liveness/readiness probes

Kubernetes probe example
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 30

readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

CloudWatch (AWS / Terraform Deployments)

The Terraform module creates a CloudWatch log group and metric alarms automatically.

Log Groups

Log Group	Content	Retention
`/ecs/vulcan-{env}`	API server application logs	30 days
`/ecs/vulcan-{env}/plugins`	Plugin subprocess output	7 days
`RDSOSMetrics`	RDS Enhanced Monitoring	30 days

Viewing Logs

# Tail live logs
aws logs tail /ecs/vulcan-prod --follow

# Filter for errors
aws logs filter-log-events \
  --log-group-name /ecs/vulcan-prod \
  --filter-pattern "ERROR" \
  --start-time $(date -d "1 hour ago" +%s000)

# Filter for a specific job ID
aws logs filter-log-events \
  --log-group-name /ecs/vulcan-prod \
  --filter-pattern '"job_id":"abc123"'

CloudWatch Metrics — ECS

Metric	Namespace	Recommended Alarm
`CPUUtilization`	`AWS/ECS`	Alert at >85% for 5 min
`MemoryUtilization`	`AWS/ECS`	Alert at >90% for 5 min
`RunningTaskCount`	`AWS/ECS`	Alert if < desired count

CloudWatch Metrics — ALB

Metric	Recommended Alarm
`TargetResponseTime`	Alert at P95 > 2 seconds
`HTTPCode_Target_5XX_Count`	Alert if > 10 in 5 min
`HealthyHostCount`	Alert if < 1
`UnHealthyHostCount`	Alert if > 0 for 2 min

CloudWatch Metrics — RDS

Metric	Recommended Alarm
`CPUUtilization`	Alert at >80% for 10 min
`DatabaseConnections`	Alert at >80% of max
`FreeStorageSpace`	Alert if < 10 GB
`ReadLatency` / `WriteLatency`	Alert if P99 > 100ms
`ReplicaLag`	Alert if > 30 seconds (Multi-AZ)

Creating Alarms with Terraform

resource "aws_cloudwatch_metric_alarm" "ecs_cpu_high" {
  alarm_name          = "vulcan-${var.environment}-ecs-cpu-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CPUUtilization"
  namespace           = "AWS/ECS"
  period              = 300
  statistic           = "Average"
  threshold           = 85

  dimensions = {
    ClusterName = aws_ecs_cluster.main.name
    ServiceName = aws_ecs_service.vulcan.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

Docker / Self-Hosted Monitoring

Docker Health Check

docker-compose.yml
services:
  vulcan-api:
    image: ghcr.io/azgardtek/vulcan:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/healthz"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 20s

Container Metrics with Prometheus

Infracast API logs are structured JSON, compatible with most log shippers. For metrics, collect from the Docker stats API:

docker-compose.monitoring.yml
services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8085:8080"

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

Log Format

Infracast outputs structured JSON logs. Each line is a self-contained JSON object:

{
  "level": "info",
  "time": "2026-04-16T12:00:00Z",
  "caller": "api/server.go:142",
  "msg": "request completed",
  "method": "GET",
  "path": "/api/v1/tenants/abc/nodes",
  "status": 200,
  "duration_ms": 42,
  "tenant_id": "abc",
  "user_id": "user-123",
  "trace_id": "0abc123def"
}

Key Log Fields

Field	Description
`level`	`debug`, `info`, `warn`, `error`
`tenant_id`	Tenant context (filter by this for tenant-specific issues)
`job_id`	Discovery job ID
`plugin`	Plugin name (for plugin-related logs)
`duration_ms`	Request or operation duration
`error`	Error message (only on error/warn)
`trace_id`	Distributed trace ID

Common Log Queries (CloudWatch Insights)

# Error rate by path
fields @timestamp, @message
| filter level = "error"
| stats count(*) as errors by path
| sort errors desc
| limit 20

# Slow requests (>1s)
fields @timestamp, method, path, duration_ms, tenant_id
| filter duration_ms > 1000
| sort duration_ms desc

# Plugin failures
fields @timestamp, plugin, job_id, @message
| filter level = "error" and ispresent(plugin)
| sort @timestamp desc

Key Metrics to Watch

Application Health

Indicator	Healthy	Warning	Critical
`/healthz`	`status: ok`	—	`db_connected: false`
API P95 latency	< 500ms	500ms–2s	> 2s
5xx error rate	< 0.1%	0.1–1%	> 1%
ECS running tasks	= desired	< desired	0

Database

Indicator	Healthy	Warning	Critical
DB connections	< 70% max	70–85% max	> 85% max
Free storage	> 20 GB	10–20 GB	< 10 GB
CPU utilization	< 50%	50–80%	> 80%

Discovery Jobs

Watch for discovery jobs that stop completing — this often indicates credential rotation issues or network changes:

# Check for stuck jobs via API
curl -H "Authorization: Bearer $TOKEN" \
  "https://api.infracast.io/api/v1/tenants/$TENANT/jobs?status=running&older_than=2h"

Uptime Monitoring

Configure an external uptime monitor for the health endpoint:

URL: https://api.infracast.io/healthz  (SaaS)
     https://your-domain.com/healthz   (self-hosted)
Method: GET
Interval: 60 seconds
Expected: HTTP 200, body contains "ok"
Alert: After 2 consecutive failures

Alerting Recommendations

Alert	Trigger	Channel	Priority
API unhealthy	`/healthz` != 200 for 2 min	PagerDuty/OpsGenie	P1
ECS task count < desired	2+ minutes	Slack/email	P2
RDS storage < 10 GB	Immediate	PagerDuty	P2
5xx rate > 1%	5 minutes	Slack	P2
Database connections > 85%	10 minutes	Slack	P3
API P95 > 2 seconds	5 minutes	Slack	P3

Discovery Job Alerting

Configure Infracast's built-in notification system (Settings → Notifications) to alert on discovery job failures. This catches credential and network issues faster than infrastructure-level monitoring.

Health Check Endpoint​

CloudWatch (AWS / Terraform Deployments)​

Log Groups​

Viewing Logs​

CloudWatch Metrics — ECS​

CloudWatch Metrics — ALB​

CloudWatch Metrics — RDS​

Creating Alarms with Terraform​

Docker / Self-Hosted Monitoring​

Docker Health Check​

Container Metrics with Prometheus​

Log Format​

Key Log Fields​

Common Log Queries (CloudWatch Insights)​

Key Metrics to Watch​

Application Health​

Database​

Discovery Jobs​

Uptime Monitoring​

Alerting Recommendations​