Troubleshooting
Common issues and solutions for Infracast deployments.
Quick Diagnostics
Start here before diving into specific issues:
# 1. Check API health
curl https://your-domain.com/healthz
# 2. Check container/service status (Docker)
docker compose ps
docker compose logs vulcan-api --tail=50
# 3. Check ECS service (AWS)
aws ecs describe-services \
--cluster vulcan-prod-cluster \
--services vulcan-prod-service \
--query "services[0].{Status:status,Running:runningCount,Desired:desiredCount}"
# 4. Tail live logs (ECS)
aws logs tail /ecs/vulcan-prod --follow
# 5. Database connectivity
docker compose exec postgres psql -U vulcan -c "SELECT 1"
API / Service Issues
API Not Starting
Symptom: Container exits immediately or ECS task fails to start.
Check the logs first:
docker compose logs vulcan-api
# or
aws logs tail /ecs/vulcan-prod --since 5m
Common causes:
| Log Message | Cause | Fix |
|---|---|---|
failed to connect to database | Wrong DATABASE_URL or DB not running | Check DATABASE_URL format and DB status |
JWT_SECRET not set | Missing required env var | Set JWT_SECRET in .env or Secrets Manager |
listen tcp :8080: bind: address already in use | Port conflict | Change API_PORT or stop conflicting process |
plugin not found: aws | Plugin binary missing | Check PLUGIN_DIR, ensure image includes plugins |
API Returns 500 Errors
Check recent error logs:
# Docker
docker compose logs vulcan-api | grep '"level":"error"'
# ECS / CloudWatch
aws logs filter-log-events \
--log-group-name /ecs/vulcan-prod \
--filter-pattern '"level":"error"'
Check database:
# Verify DB is healthy
curl https://your-domain.com/healthz
# If db_connected is false, investigate database connectivity
API Returns 503
The health check is failing. Common causes:
- Database is down or unreachable
- ECS task out of memory (check CloudWatch
MemoryUtilization) - Startup still in progress (wait 30 seconds and retry)
Discovery Job Failures
Job Fails Immediately
Symptom: Discovery job goes from pending to failed in seconds.
- Open the job in the UI and click View Logs
- Look for the first error message
Common causes:
| Error | Cause | Fix |
|---|---|---|
credential not found | Credential ID doesn't exist | Re-link credential to the job |
invalid credentials | Wrong key/password | Test credentials in Settings → Credentials → Test |
plugin not registered: vmware | Plugin binary missing | Check plugin is installed; contact support |
context deadline exceeded | Timeout before first connection | Increase job timeout; check firewall |
Discovery Times Out
Symptom: Job runs for the timeout duration then fails with timeout.
-
Check network connectivity from relay to target:
- For relay-based jobs: verify relay is online, test target connectivity via Relay → Test Connection
- For direct SaaS: ensure your cloud credentials have necessary permissions
-
Reduce scan scope for initial troubleshooting:
- AWS: specify a single region instead of scanning all regions
- Network: scan a single device before scanning a whole subnet
-
Increase timeout in job configuration (default: 5 minutes, max: 60 minutes)
Partial Discovery Results
Symptom: Some resources are discovered but not all.
- AWS: Missing regions — check
regionsconfig includes all target regions - AWS: IAM permission gaps — use the IAM policy generator to ensure all required permissions
- VMware: Missing VMs — verify vCenter credentials have read access to all datacenters
- Network: Missing devices — check SNMP community strings and access control lists
"Invalid credentials" on a Job That Worked Before
Credentials may have been rotated. Update the credential in Settings → Credentials and re-run the job.
Relay Issues
Relay Shows "Offline"
# Check relay container is running
docker logs vulcan-relay
# Test outbound connectivity from relay host
curl -v wss://api.infracast.io/ws/relay 2>&1 | head -20
Common causes:
| Issue | Fix |
|---|---|
| Container stopped | docker start vulcan-relay |
| Token revoked | Create new token in UI, redeploy relay |
| Outbound port 443 blocked | Check firewall rules; WSS requires port 443 |
| Proxy intercepting WebSocket | Configure proxy to allow WebSocket upgrade |
Relay Connected but Discovery Fails
The relay is online but tasks dispatched through it fail:
- Verify relay has network access to the target (not just to the internet)
- Use Relay → Test Connection in the UI to test connectivity to target host/port
- Check credentials are correct
- Review relay logs:
docker logs vulcan-relay --tail=100
UI Issues
UI Not Loading / Blank Screen
Step 1: Hard refresh — Browser may have cached old assets:
- Chrome/Firefox:
Ctrl+Shift+R(Windows/Linux) orCmd+Shift+R(macOS) - Safari:
Cmd+Option+R
Step 2: Check browser console for JavaScript errors (F12 → Console)
Step 3: Verify API is accessible:
curl https://api.your-domain.com/healthz
Step 4: Clear CloudFront cache (AWS self-hosted):
aws cloudfront create-invalidation \
--distribution-id YOUR_DIST_ID \
--paths "/*"
UI Shows "Network Error" / API Unreachable
- Verify
VITE_API_URLenvironment variable is set correctly in the UI container - Check CORS configuration — if UI and API are on different origins, ensure
ALLOWED_ORIGINSincludes the UI origin - For self-hosted: verify the nginx/ALB routing configuration forwards
/api/*to the API server
Login Fails with Valid Credentials
- Verify the admin user was bootstrapped correctly
- Check API logs for authentication errors
- If JWT_SECRET was rotated, all sessions are invalidated — users must log in again
- Check for clock skew between client and server (JWT validation is time-sensitive)
Database Issues
"Too Many Connections" Error
PostgreSQL has a maximum connection limit. Infracast uses a connection pool internally.
Immediate fix:
# Check current connections
psql -U vulcan -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
# Terminate idle connections (use with caution)
psql -U vulcan -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < NOW() - INTERVAL '10 minutes';"
Long-term fix:
- Increase
max_connectionsin PostgreSQL config (requires restart) - Add PgBouncer connection pooler in front of PostgreSQL
- Scale down the number of ECS tasks if using too many pools
Database Disk Full
Symptom: API returns 500 errors, logs show no space left on device.
Immediate relief:
# Check disk usage
psql -U vulcan -c "SELECT pg_database_size('vulcan');"
# Vacuum to reclaim space
psql -U vulcan -d vulcan -c "VACUUM FULL;"
In AWS (RDS):
# Increase storage (online, no restart needed for gp2/gp3)
aws rds modify-db-instance \
--db-instance-identifier vulcan-prod \
--allocated-storage 200 \
--apply-immediately
Migration Fails on Startup
Symptom: API logs show migration failed and exits.
- Check the error message for specific SQL failure
- Verify the database user has DDL permissions
- If upgrading from a very old version, review the upgrade guide
Agent Issues
Agent Not Appearing in UI
-
Verify the agent service is running:
systemctl status infracast-agent # Linux
sc query InfracastAgent # Windows -
Check agent logs:
journalctl -u infracast-agent -n 50 --no-pager # Linux
# Windows: Event Viewer → Application log, source = InfracastAgent -
Verify outbound HTTPS connectivity from the host:
curl -v https://api.infracast.io/healthz -
Confirm the enrollment token was not already used (tokens are single-use)
Agent Shows "Stale" or "Offline"
The agent stopped sending heartbeats. Common causes:
- Host is powered off or network unreachable
- Agent service crashed — check system logs
- API server unreachable from the host (firewall change)
Restart the agent:
systemctl restart infracast-agent # Linux
Restart-Service InfracastAgent # Windows
Log Locations
| Deployment | Log Location | Command |
|---|---|---|
| Docker | Container stdout | docker compose logs vulcan-api |
| ECS / AWS | CloudWatch | aws logs tail /ecs/vulcan-{env} --follow |
| Agent (Linux) | journald | journalctl -u infracast-agent -f |
| Agent (Windows) | Event Viewer | Application log, source InfracastAgent |
| Relay (Docker) | Container stdout | docker logs vulcan-relay -f |
Getting Support
If you can't resolve an issue with this guide:
-
Collect diagnostics:
# API version and health
curl https://your-domain.com/api/v1/version
curl https://your-domain.com/healthz
# Last 100 error log lines
docker compose logs vulcan-api 2>&1 | grep '"level":"error"' | tail -100 -
Open a support ticket at support.infracast.io with:
- Deployment type (SaaS / Docker / ECS)
- Version (
curl /api/v1/version) - Relevant log lines
- Steps to reproduce
-
Community Slack — infracast.io/slack for community help