Troubleshooting

Common issues and solutions for Infracast deployments.

Quick Diagnostics

Start here before diving into specific issues:

# 1. Check API health
curl https://your-domain.com/healthz

# 2. Check container/service status (Docker)
docker compose ps
docker compose logs vulcan-api --tail=50

# 3. Check ECS service (AWS)
aws ecs describe-services \
  --cluster vulcan-prod-cluster \
  --services vulcan-prod-service \
  --query "services[0].{Status:status,Running:runningCount,Desired:desiredCount}"

# 4. Tail live logs (ECS)
aws logs tail /ecs/vulcan-prod --follow

# 5. Database connectivity
docker compose exec postgres psql -U vulcan -c "SELECT 1"

API / Service Issues

API Not Starting

Symptom: Container exits immediately or ECS task fails to start.

Check the logs first:

docker compose logs vulcan-api
# or
aws logs tail /ecs/vulcan-prod --since 5m

Common causes:

Log Message	Cause	Fix
`failed to connect to database`	Wrong DATABASE_URL or DB not running	Check `DATABASE_URL` format and DB status
`JWT_SECRET not set`	Missing required env var	Set `JWT_SECRET` in .env or Secrets Manager
`listen tcp :8080: bind: address already in use`	Port conflict	Change `API_PORT` or stop conflicting process
`plugin not found: aws`	Plugin binary missing	Check `PLUGIN_DIR`, ensure image includes plugins

API Returns 500 Errors

Check recent error logs:

# Docker
docker compose logs vulcan-api | grep '"level":"error"'

# ECS / CloudWatch
aws logs filter-log-events \
  --log-group-name /ecs/vulcan-prod \
  --filter-pattern '"level":"error"'

Check database:

# Verify DB is healthy
curl https://your-domain.com/healthz
# If db_connected is false, investigate database connectivity

API Returns 503

The health check is failing. Common causes:

Database is down or unreachable
ECS task out of memory (check CloudWatch MemoryUtilization)
Startup still in progress (wait 30 seconds and retry)

Discovery Job Failures

Job Fails Immediately

Symptom: Discovery job goes from pending to failed in seconds.

Open the job in the UI and click View Logs
Look for the first error message

Common causes:

Error	Cause	Fix
`credential not found`	Credential ID doesn't exist	Re-link credential to the job
`invalid credentials`	Wrong key/password	Test credentials in Settings → Credentials → Test
`plugin not registered: vmware`	Plugin binary missing	Check plugin is installed; contact support
`context deadline exceeded`	Timeout before first connection	Increase job timeout; check firewall

Discovery Times Out

Symptom: Job runs for the timeout duration then fails with timeout.

Check network connectivity from relay to target:
- For relay-based jobs: verify relay is online, test target connectivity via Relay → Test Connection
- For direct SaaS: ensure your cloud credentials have necessary permissions
Reduce scan scope for initial troubleshooting:
- AWS: specify a single region instead of scanning all regions
- Network: scan a single device before scanning a whole subnet
Increase timeout in job configuration (default: 5 minutes, max: 60 minutes)

Partial Discovery Results

Symptom: Some resources are discovered but not all.

AWS: Missing regions — check regions config includes all target regions
AWS: IAM permission gaps — use the IAM policy generator to ensure all required permissions
VMware: Missing VMs — verify vCenter credentials have read access to all datacenters
Network: Missing devices — check SNMP community strings and access control lists

"Invalid credentials" on a Job That Worked Before

Credentials may have been rotated. Update the credential in Settings → Credentials and re-run the job.

Relay Issues

Relay Shows "Offline"

# Check relay container is running
docker logs vulcan-relay

# Test outbound connectivity from relay host
curl -v wss://api.infracast.io/ws/relay 2>&1 | head -20

Common causes:

Issue	Fix
Container stopped	`docker start vulcan-relay`
Token revoked	Create new token in UI, redeploy relay
Outbound port 443 blocked	Check firewall rules; WSS requires port 443
Proxy intercepting WebSocket	Configure proxy to allow WebSocket upgrade

Relay Connected but Discovery Fails

The relay is online but tasks dispatched through it fail:

Verify relay has network access to the target (not just to the internet)
Use Relay → Test Connection in the UI to test connectivity to target host/port
Check credentials are correct
Review relay logs: docker logs vulcan-relay --tail=100

UI Issues

UI Not Loading / Blank Screen

Step 1: Hard refresh — Browser may have cached old assets:

Chrome/Firefox: Ctrl+Shift+R (Windows/Linux) or Cmd+Shift+R (macOS)
Safari: Cmd+Option+R

Step 2: Check browser console for JavaScript errors (F12 → Console)

Step 3: Verify API is accessible:

curl https://api.your-domain.com/healthz

Step 4: Clear CloudFront cache (AWS self-hosted):

aws cloudfront create-invalidation \
  --distribution-id YOUR_DIST_ID \
  --paths "/*"

UI Shows "Network Error" / API Unreachable

Verify VITE_API_URL environment variable is set correctly in the UI container
Check CORS configuration — if UI and API are on different origins, ensure ALLOWED_ORIGINS includes the UI origin
For self-hosted: verify the nginx/ALB routing configuration forwards /api/* to the API server

Verify the admin user was bootstrapped correctly
Check API logs for authentication errors
If JWT_SECRET was rotated, all sessions are invalidated — users must log in again
Check for clock skew between client and server (JWT validation is time-sensitive)

Database Issues

"Too Many Connections" Error

PostgreSQL has a maximum connection limit. Infracast uses a connection pool internally.

Immediate fix:

# Check current connections
psql -U vulcan -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"

# Terminate idle connections (use with caution)
psql -U vulcan -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < NOW() - INTERVAL '10 minutes';"

Long-term fix:

Increase max_connections in PostgreSQL config (requires restart)
Add PgBouncer connection pooler in front of PostgreSQL
Scale down the number of ECS tasks if using too many pools

Database Disk Full

Symptom: API returns 500 errors, logs show no space left on device.

Immediate relief:

# Check disk usage
psql -U vulcan -c "SELECT pg_database_size('vulcan');"

# Vacuum to reclaim space
psql -U vulcan -d vulcan -c "VACUUM FULL;"

In AWS (RDS):

# Increase storage (online, no restart needed for gp2/gp3)
aws rds modify-db-instance \
  --db-instance-identifier vulcan-prod \
  --allocated-storage 200 \
  --apply-immediately

Migration Fails on Startup

Symptom: API logs show migration failed and exits.

Check the error message for specific SQL failure
Verify the database user has DDL permissions
If upgrading from a very old version, review the upgrade guide

Agent Issues

Agent Not Appearing in UI

Verify the agent service is running:

systemctl status infracast-agent   # Linux
sc query InfracastAgent            # Windows

Check agent logs:

journalctl -u infracast-agent -n 50 --no-pager   # Linux
# Windows: Event Viewer → Application log, source = InfracastAgent

Verify outbound HTTPS connectivity from the host:
```
curl -v https://api.infracast.io/healthz
```
Confirm the enrollment token was not already used (tokens are single-use)

Agent Shows "Stale" or "Offline"

The agent stopped sending heartbeats. Common causes:

Host is powered off or network unreachable
Agent service crashed — check system logs
API server unreachable from the host (firewall change)

Restart the agent:

systemctl restart infracast-agent   # Linux
Restart-Service InfracastAgent       # Windows

Log Locations

Deployment	Log Location	Command
Docker	Container stdout	`docker compose logs vulcan-api`
ECS / AWS	CloudWatch	`aws logs tail /ecs/vulcan-{env} --follow`
Agent (Linux)	journald	`journalctl -u infracast-agent -f`
Agent (Windows)	Event Viewer	Application log, source InfracastAgent
Relay (Docker)	Container stdout	`docker logs vulcan-relay -f`

Getting Support

If you can't resolve an issue with this guide:

Collect diagnostics:

# API version and health
curl https://your-domain.com/api/v1/version
curl https://your-domain.com/healthz

# Last 100 error log lines
docker compose logs vulcan-api 2>&1 | grep '"level":"error"' | tail -100

Open a support ticket at support.infracast.io with:
- Deployment type (SaaS / Docker / ECS)
- Version (curl /api/v1/version)
- Relevant log lines
- Steps to reproduce
Community Slack — infracast.io/slack for community help

Quick Diagnostics​

API / Service Issues​

API Not Starting​

API Returns 500 Errors​

API Returns 503​

Discovery Job Failures​

Job Fails Immediately​

Discovery Times Out​

Partial Discovery Results​

"Invalid credentials" on a Job That Worked Before​

Relay Issues​

Relay Shows "Offline"​

Relay Connected but Discovery Fails​

UI Issues​

UI Not Loading / Blank Screen​

UI Shows "Network Error" / API Unreachable​

Login Fails with Valid Credentials​

Database Issues​

"Too Many Connections" Error​

Database Disk Full​

Migration Fails on Startup​

Agent Issues​

Agent Not Appearing in UI​

Agent Shows "Stale" or "Offline"​

Log Locations​

Getting Support​

Quick Diagnostics

API / Service Issues

API Not Starting

API Returns 500 Errors

API Returns 503

Discovery Job Failures

Job Fails Immediately

Discovery Times Out

Partial Discovery Results

"Invalid credentials" on a Job That Worked Before

Relay Issues

Relay Shows "Offline"

Relay Connected but Discovery Fails

UI Issues

UI Not Loading / Blank Screen

UI Shows "Network Error" / API Unreachable

Login Fails with Valid Credentials

Database Issues

"Too Many Connections" Error

Database Disk Full

Migration Fails on Startup

Agent Issues

Agent Not Appearing in UI

Agent Shows "Stale" or "Offline"

Log Locations

Getting Support