Skip to main content

Troubleshooting

Common issues and solutions for Infracast deployments.

Quick Diagnostics

Start here before diving into specific issues:

# 1. Check API health
curl https://your-domain.com/healthz

# 2. Check container/service status (Docker)
docker compose ps
docker compose logs vulcan-api --tail=50

# 3. Check ECS service (AWS)
aws ecs describe-services \
--cluster vulcan-prod-cluster \
--services vulcan-prod-service \
--query "services[0].{Status:status,Running:runningCount,Desired:desiredCount}"

# 4. Tail live logs (ECS)
aws logs tail /ecs/vulcan-prod --follow

# 5. Database connectivity
docker compose exec postgres psql -U vulcan -c "SELECT 1"

API / Service Issues

API Not Starting

Symptom: Container exits immediately or ECS task fails to start.

Check the logs first:

docker compose logs vulcan-api
# or
aws logs tail /ecs/vulcan-prod --since 5m

Common causes:

Log MessageCauseFix
failed to connect to databaseWrong DATABASE_URL or DB not runningCheck DATABASE_URL format and DB status
JWT_SECRET not setMissing required env varSet JWT_SECRET in .env or Secrets Manager
listen tcp :8080: bind: address already in usePort conflictChange API_PORT or stop conflicting process
plugin not found: awsPlugin binary missingCheck PLUGIN_DIR, ensure image includes plugins

API Returns 500 Errors

Check recent error logs:

# Docker
docker compose logs vulcan-api | grep '"level":"error"'

# ECS / CloudWatch
aws logs filter-log-events \
--log-group-name /ecs/vulcan-prod \
--filter-pattern '"level":"error"'

Check database:

# Verify DB is healthy
curl https://your-domain.com/healthz
# If db_connected is false, investigate database connectivity

API Returns 503

The health check is failing. Common causes:

  • Database is down or unreachable
  • ECS task out of memory (check CloudWatch MemoryUtilization)
  • Startup still in progress (wait 30 seconds and retry)

Discovery Job Failures

Job Fails Immediately

Symptom: Discovery job goes from pending to failed in seconds.

  1. Open the job in the UI and click View Logs
  2. Look for the first error message

Common causes:

ErrorCauseFix
credential not foundCredential ID doesn't existRe-link credential to the job
invalid credentialsWrong key/passwordTest credentials in Settings → Credentials → Test
plugin not registered: vmwarePlugin binary missingCheck plugin is installed; contact support
context deadline exceededTimeout before first connectionIncrease job timeout; check firewall

Discovery Times Out

Symptom: Job runs for the timeout duration then fails with timeout.

  1. Check network connectivity from relay to target:

    • For relay-based jobs: verify relay is online, test target connectivity via Relay → Test Connection
    • For direct SaaS: ensure your cloud credentials have necessary permissions
  2. Reduce scan scope for initial troubleshooting:

    • AWS: specify a single region instead of scanning all regions
    • Network: scan a single device before scanning a whole subnet
  3. Increase timeout in job configuration (default: 5 minutes, max: 60 minutes)

Partial Discovery Results

Symptom: Some resources are discovered but not all.

  • AWS: Missing regions — check regions config includes all target regions
  • AWS: IAM permission gaps — use the IAM policy generator to ensure all required permissions
  • VMware: Missing VMs — verify vCenter credentials have read access to all datacenters
  • Network: Missing devices — check SNMP community strings and access control lists

"Invalid credentials" on a Job That Worked Before

Credentials may have been rotated. Update the credential in Settings → Credentials and re-run the job.


Relay Issues

Relay Shows "Offline"

# Check relay container is running
docker logs vulcan-relay

# Test outbound connectivity from relay host
curl -v wss://api.infracast.io/ws/relay 2>&1 | head -20

Common causes:

IssueFix
Container stoppeddocker start vulcan-relay
Token revokedCreate new token in UI, redeploy relay
Outbound port 443 blockedCheck firewall rules; WSS requires port 443
Proxy intercepting WebSocketConfigure proxy to allow WebSocket upgrade

Relay Connected but Discovery Fails

The relay is online but tasks dispatched through it fail:

  1. Verify relay has network access to the target (not just to the internet)
  2. Use Relay → Test Connection in the UI to test connectivity to target host/port
  3. Check credentials are correct
  4. Review relay logs: docker logs vulcan-relay --tail=100

UI Issues

UI Not Loading / Blank Screen

Step 1: Hard refresh — Browser may have cached old assets:

  • Chrome/Firefox: Ctrl+Shift+R (Windows/Linux) or Cmd+Shift+R (macOS)
  • Safari: Cmd+Option+R

Step 2: Check browser console for JavaScript errors (F12 → Console)

Step 3: Verify API is accessible:

curl https://api.your-domain.com/healthz

Step 4: Clear CloudFront cache (AWS self-hosted):

aws cloudfront create-invalidation \
--distribution-id YOUR_DIST_ID \
--paths "/*"

UI Shows "Network Error" / API Unreachable

  • Verify VITE_API_URL environment variable is set correctly in the UI container
  • Check CORS configuration — if UI and API are on different origins, ensure ALLOWED_ORIGINS includes the UI origin
  • For self-hosted: verify the nginx/ALB routing configuration forwards /api/* to the API server

Login Fails with Valid Credentials

  1. Verify the admin user was bootstrapped correctly
  2. Check API logs for authentication errors
  3. If JWT_SECRET was rotated, all sessions are invalidated — users must log in again
  4. Check for clock skew between client and server (JWT validation is time-sensitive)

Database Issues

"Too Many Connections" Error

PostgreSQL has a maximum connection limit. Infracast uses a connection pool internally.

Immediate fix:

# Check current connections
psql -U vulcan -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# Terminate idle connections (use with caution)
psql -U vulcan -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < NOW() - INTERVAL '10 minutes';"

Long-term fix:

  • Increase max_connections in PostgreSQL config (requires restart)
  • Add PgBouncer connection pooler in front of PostgreSQL
  • Scale down the number of ECS tasks if using too many pools

Database Disk Full

Symptom: API returns 500 errors, logs show no space left on device.

Immediate relief:

# Check disk usage
psql -U vulcan -c "SELECT pg_database_size('vulcan');"

# Vacuum to reclaim space
psql -U vulcan -d vulcan -c "VACUUM FULL;"

In AWS (RDS):

# Increase storage (online, no restart needed for gp2/gp3)
aws rds modify-db-instance \
--db-instance-identifier vulcan-prod \
--allocated-storage 200 \
--apply-immediately

Migration Fails on Startup

Symptom: API logs show migration failed and exits.

  1. Check the error message for specific SQL failure
  2. Verify the database user has DDL permissions
  3. If upgrading from a very old version, review the upgrade guide

Agent Issues

Agent Not Appearing in UI

  1. Verify the agent service is running:

    systemctl status infracast-agent   # Linux
    sc query InfracastAgent # Windows
  2. Check agent logs:

    journalctl -u infracast-agent -n 50 --no-pager   # Linux
    # Windows: Event Viewer → Application log, source = InfracastAgent
  3. Verify outbound HTTPS connectivity from the host:

    curl -v https://api.infracast.io/healthz
  4. Confirm the enrollment token was not already used (tokens are single-use)

Agent Shows "Stale" or "Offline"

The agent stopped sending heartbeats. Common causes:

  • Host is powered off or network unreachable
  • Agent service crashed — check system logs
  • API server unreachable from the host (firewall change)

Restart the agent:

systemctl restart infracast-agent   # Linux
Restart-Service InfracastAgent # Windows

Log Locations

DeploymentLog LocationCommand
DockerContainer stdoutdocker compose logs vulcan-api
ECS / AWSCloudWatchaws logs tail /ecs/vulcan-{env} --follow
Agent (Linux)journaldjournalctl -u infracast-agent -f
Agent (Windows)Event ViewerApplication log, source InfracastAgent
Relay (Docker)Container stdoutdocker logs vulcan-relay -f

Getting Support

If you can't resolve an issue with this guide:

  1. Collect diagnostics:

    # API version and health
    curl https://your-domain.com/api/v1/version
    curl https://your-domain.com/healthz

    # Last 100 error log lines
    docker compose logs vulcan-api 2>&1 | grep '"level":"error"' | tail -100
  2. Open a support ticket at support.infracast.io with:

    • Deployment type (SaaS / Docker / ECS)
    • Version (curl /api/v1/version)
    • Relevant log lines
    • Steps to reproduce
  3. Community Slackinfracast.io/slack for community help