Worker Troubleshooting
Diagnose and resolve common worker issues including connection failures, status problems, test execution errors, resource exhaustion, and token issues.
This guide covers every common worker issue, organized by symptom. Start with the symptom you are observing, follow the diagnostic steps, and apply the resolution.
Worker Shows OFFLINE
The worker is not connected to the CallMeter Worker Gateway.
Symptoms
- Worker status shows OFFLINE in the CallMeter dashboard
- Health endpoint returns
"connected": falseor is unreachable - No heartbeats received by the platform
Diagnostic Steps
1. Check if the container is running:
docker ps | grep callmeter-workerIf the container is not listed, it has stopped or crashed:
# Check stopped containers
docker ps -a | grep callmeter-worker
# View exit reason
docker inspect callmeter-worker --format='{{.State.ExitCode}} {{.State.Error}}'
# Check logs for crash reason
docker logs --tail 100 callmeter-worker2. Check the worker token:
Verify the WORKER_TOKEN environment variable is set correctly:
docker exec callmeter-worker env | grep WORKER_TOKENThe token should:
- Start with
cmw_ - Be exactly 68 characters long
- Match the token generated (or last regenerated) in the CallMeter UI
3. Check network connectivity to the gateway:
# DNS resolution
docker exec callmeter-worker nslookup wg.callmeter.io
# TCP connectivity
docker exec callmeter-worker nc -zv wg.callmeter.io 443If DNS fails, check your DNS resolver configuration. If TCP connection is refused or times out, check your firewall rules for outbound TCP 443.
4. Check container logs for connection errors:
docker logs --tail 200 callmeter-worker | grep -i "error\|fail\|refused\|timeout"Common Causes and Resolutions
| Cause | Resolution |
|---|---|
| Container not running | docker compose up -d or investigate crash logs |
| Wrong or missing token | Verify token in UI, update WORKER_TOKEN, restart container |
| Firewall blocking outbound TCP 443 | Add firewall rule allowing outbound to wg.callmeter.io:443 |
| DNS resolution failure | Check DNS resolver, add exception for callmeter.io |
| TLS inspection breaking connection | Bypass TLS inspection for wg.callmeter.io |
| Token regenerated but container not updated | Update WORKER_TOKEN with new token, restart container |
| Worker disabled by admin | Re-enable the worker in the CallMeter UI |
Connection Keeps Dropping
The worker alternates between ONLINE and OFFLINE or ERROR status.
Symptoms
- Worker status flaps between ONLINE and OFFLINE or ERROR
- Logs show repeated "connected" and "disconnected" messages
- Tests fail intermittently due to worker unavailability
Diagnostic Steps
1. Check connection stability:
# Watch logs for connection events
docker logs -f callmeter-worker | grep -i "connect\|disconnect\|reconnect"Note the pattern: Is it regular (suggesting a timeout) or random (suggesting network instability)?
2. Check for OOM kills:
docker inspect callmeter-worker --format='{{.State.OOMKilled}}'If true, the container is being killed due to memory exhaustion and restarting.
3. Check host resource pressure:
docker stats callmeter-worker --no-streamLook for CPU or memory near limits.
4. Test network path stability:
# Continuous ping to gateway (from host)
ping -c 60 wg.callmeter.io
# Check for packet loss
mtr --report wg.callmeter.ioCommon Causes and Resolutions
| Cause | Resolution |
|---|---|
| Aggressive firewall idle timeout | Ensure firewall idle timeout exceeds 60 seconds for TCP to wg.callmeter.io |
| NAT device dropping long-lived connections | Configure NAT keepalive or add TCP keepalive rules |
| Container OOM-killed | Increase Docker memory limit (memory: 4G in compose) |
| Unstable network path | Check for packet loss between worker host and gateway; consider a more stable network path |
| TLS inspection intermittently resetting connections | Bypass wg.callmeter.io from TLS inspection |
| Docker daemon restarts | Check journalctl -u docker for daemon restart events |
Status Stuck in ERROR
The worker shows ERROR status and does not recover.
Symptoms
- Worker shows ERROR in the dashboard
- Worker does not transition back to ONLINE
- Tests cannot be assigned to this worker
Diagnostic Steps
1. Check the error reason in the dashboard:
Open the worker details in CallMeter. The ERROR status may include a reason such as "heartbeat timeout" or "authentication failure."
2. Check container logs:
docker logs --tail 200 callmeter-worker | grep -i "error\|heartbeat\|auth\|license"3. Check if the worker license is active:
Navigate to your organization's Billing page in CallMeter. Verify that your worker license subscription is active and not expired or past due.
4. Check if the worker is enabled:
In the worker details page, verify the worker is not disabled. A disabled worker shows an explicit "Disabled" indicator.
Common Causes and Resolutions
| Cause | Resolution |
|---|---|
| Heartbeat timeout (60+ seconds without heartbeat) | Restart the container; investigate network stability if recurring |
| License expired | Renew your worker license subscription |
| Subscription past due | Update payment method; worker resumes after payment processes |
| Worker disabled by admin | Re-enable the worker in the UI |
| Token invalidated (regenerated elsewhere) | Update container with current token and restart |
| Gateway detected anomalous behavior | Contact support with worker ID and timestamp |
ERROR Status Recovery
In most cases, restarting the worker container clears the ERROR state. If the underlying cause persists (expired license, invalid token), the worker will return to ERROR shortly after reconnecting.
Test Execution Failures
Tests fail to start or endpoints report errors during execution.
Symptoms
- Test shows "No available capacity" error
- Endpoints fail with SIP registration errors
- Endpoints fail with media errors
- Test starts but some endpoints never establish calls
No Available Capacity
Cause: The selected workers do not have enough free endpoint slots.
Resolution:
- Check each worker's capacity usage on the Workers page
- Wait for in-progress tests to complete, or
- Reduce the number of endpoints in your test, or
- Deploy additional workers
SIP Registration Failures
Cause: The worker cannot register with the SIP server.
Diagnostic steps:
# Check if SIP server is reachable from the worker
docker exec callmeter-worker nc -zuv <sip-server-ip> 5060Common causes:
- SIP server not reachable from worker's network
- Incorrect SIP credentials in registrar configuration
- SIP server rejecting registrations (check SIP response codes in test results)
- Firewall blocking SIP traffic between worker and SIP server
- Wrong transport protocol (e.g., test configured for TCP but server only accepts UDP)
RTP Media Failures
Cause: SIP calls establish but media streams fail.
Common causes:
- Firewall blocking UDP in the RTP port range between worker and SIP/media server
- NAT misconfiguration (wrong
SDP_IPorEXTERNAL_IP) - RTP port range too narrow for the number of concurrent calls
- SIP server and worker on different networks without proper routing
Resolution:
- Check firewall rules for bidirectional UDP in the RTP port range
- If behind NAT, configure
SDP_IPandEXTERNAL_IP(see Configuration) - Ensure the RTP port range is large enough (default
10000-65535is sufficient for any deployment) - Try switching to Docker host networking mode (
network_mode: host)
Endpoints Timing Out
Cause: Endpoints start but never transition to active call state.
Common causes:
- SIP server slow to respond (overloaded or misconfigured)
- Buildup period too short (all endpoints registering simultaneously overwhelms the SIP server)
- Callee group not available or misconfigured
- DNS resolution delays for SIP server hostname
High Resource Usage
The worker consumes more CPU or memory than expected.
Symptoms
- Container CPU at or near limit
- Container memory growing continuously
- Docker reports OOMKilled events
- Test results show degraded metrics (high jitter, packet loss)
Diagnostic Steps
# Real-time resource monitoring
docker stats callmeter-worker
# Check for OOM events
docker inspect callmeter-worker --format='{{.State.OOMKilled}}'
# Check container resource limits
docker inspect callmeter-worker --format='{{.HostConfig.Memory}} {{.HostConfig.NanoCpus}}'Resource Usage Guidelines
| Endpoints (audio) | Expected CPU | Expected Memory |
|---|---|---|
| 50 | 1-2 cores | 500 MB - 1 GB |
| 100 | 2-3 cores | 1 - 2 GB |
| 200 | 3-5 cores | 2 - 4 GB |
| 500 | 6-10 cores | 4 - 8 GB |
Video calls multiply CPU usage by 5-10x and memory by 3-4x.
Resolutions
| Issue | Resolution |
|---|---|
| CPU at limit | Reduce configured capacity or increase CPU limit |
| Memory growing without bound | Possible memory leak; restart container, check for latest image |
| OOMKilled | Increase memory limit (memory: 8G) |
| Degraded metrics at high load | Capacity exceeds hardware capability; reduce endpoints or upgrade hardware |
| Video tests exhausting CPU | Use lower video resolution/FPS, or upgrade to more cores |
Degraded Results at High Load
If a worker is resource-constrained during a test, the metrics it reports will be inaccurate. High CPU usage causes scheduling delays that show up as artificial jitter and packet loss. Always ensure your worker has headroom before trusting quality metrics.
Token Issues
Problems related to the worker authentication token.
Lost Token
Symptom: You need to deploy or redeploy a worker but do not have the token.
Resolution: Tokens are shown once at creation time. If lost, you must regenerate:
- Open the worker in CallMeter
- Click Regenerate Token
- Copy the new token immediately
- Update the
WORKER_TOKENenvironment variable - Restart the container
Regeneration Is Immediate
Regenerating a token immediately invalidates the old one. If the worker is currently connected with the old token, it will be disconnected and cannot reconnect until updated.
Invalid Token
Symptom: Logs show authentication failed: invalid token. Worker stays OFFLINE.
Possible causes:
- Token was copied incorrectly (missing characters, extra whitespace)
- Token was regenerated but the container was not updated
- Token belongs to a different worker
- Token includes surrounding quotes that are being passed as part of the value
Resolution:
- Verify the token is exactly 68 characters:
echo -n "$WORKER_TOKEN" | wc -c - Verify it starts with
cmw_ - Check for surrounding quotes or whitespace
- If in doubt, regenerate and copy carefully
Duplicate Connection
Symptom: Logs show authentication failed: duplicate connection. Second worker cannot connect.
Cause: Each token can only be used by one worker simultaneously. If two containers try to connect with the same token, the second is rejected.
Resolution:
- Ensure only one container is running with each token
- If a previous container crashed and the connection is stale, wait up to 90 seconds for the platform to release the lock
- Each worker in a multi-worker deployment must have its own unique token
Health Check Failures
The health endpoint reports problems or is unreachable.
Health Endpoint Unreachable
Symptom: curl http://localhost:3030/health returns "Connection refused."
Possible causes:
- Container is not running
- Health port is mapped differently (check
HEALTH_PORTand Docker port mapping) - Container networking issue
Resolution:
- Verify the container is running:
docker ps | grep callmeter - Check the port mapping:
docker port callmeter-worker - Try the container's internal address:
docker exec callmeter-worker curl http://localhost:3030/health
Status: unhealthy
Symptom: Health endpoint returns "status": "unhealthy".
Possible causes:
- Worker process encountered an internal error
- Resource exhaustion (CPU, memory)
- Connection to gateway lost
Resolution:
- Check container logs for error messages
- Check resource usage with
docker stats - If no obvious cause, restart the container
Connected: false
Symptom: Health endpoint returns "connected": false but "status": "healthy".
Cause: The worker process is running but has no active connection to the gateway. It may be in the process of reconnecting.
Resolution:
- Wait 30-60 seconds --- the worker reconnects automatically with exponential backoff
- If it does not reconnect, check network connectivity to
wg.callmeter.io:443 - Check logs for authentication or connection errors
Log Analysis Guide
Worker logs contain structured information for diagnosing issues. Here is how to read them.
Log Levels in Order of Severity
| Level | Meaning | Action |
|---|---|---|
TRACE | Packet-level detail | Informational only; not present unless LOG_LEVEL=trace |
DEBUG | Detailed operational events | Useful for diagnosing specific issues |
INFO | Normal operation | Expected during healthy operation |
WARN | Potential issue, non-fatal | Investigate if recurring |
ERROR | Operation failed | Requires attention |
FATAL | Worker cannot continue | Worker will exit; check cause immediately |
Common Log Messages and Meanings
| Log Message | Level | Meaning | Action |
|---|---|---|---|
Connecting to Worker Gateway | INFO | Initial connection attempt | Normal startup |
Authentication successful | INFO | Token validated, worker registered | Normal |
Status: ONLINE | INFO | Worker ready for test assignments | Normal |
Heartbeat sent | DEBUG | Regular keepalive | Normal |
Connection lost, reconnecting | WARN | Gateway connection dropped | Check network; worker will retry |
Reconnecting in Xs | WARN | Backoff before retry | Normal reconnection behavior |
Authentication failed: invalid token | ERROR | Token rejected | Verify or regenerate token |
Authentication failed: duplicate connection | ERROR | Token already in use | Stop other instance using this token |
Authentication failed: worker disabled | ERROR | Worker administratively disabled | Re-enable in UI |
No capacity available | WARN | Cannot accept more endpoints | Reduce test load or add workers |
Endpoint registration failed | WARN | SIP REGISTER rejected | Check SIP credentials and server reachability |
Out of memory | FATAL | Process killed by OS | Increase container memory limit |
Extracting Relevant Logs
# Errors only
docker logs callmeter-worker 2>&1 | grep -i "error\|fatal"
# Connection events
docker logs callmeter-worker 2>&1 | grep -i "connect\|auth\|handshake"
# SIP events
docker logs callmeter-worker 2>&1 | grep -i "register\|invite\|sip"
# Specific time range (last 5 minutes)
docker logs --since 5m callmeter-worker
# Follow logs in real time (filtered)
docker logs -f callmeter-worker 2>&1 | grep -i "error\|warn"When to Contact Support
Contact support if:
- The worker repeatedly enters ERROR status with no clear cause in logs
- Authentication succeeds but the worker never receives test assignments
- Metrics are missing or clearly incorrect despite normal resource usage
- You encounter a log message you cannot interpret
- The container crashes repeatedly with FATAL errors
- You need help with a complex network topology (multi-NAT, proxy chains, VPN tunnels)
When contacting support, include:
- Worker ID (visible on the Workers page in CallMeter)
- Container logs from the time of the issue:
docker logs --since 30m callmeter-worker > worker-logs.txt - Health endpoint output:
curl http://localhost:3030/health - Docker version:
docker --version - Host OS:
uname -a - Network test results:
nc -zv wg.callmeter.io 443output - Docker resource usage:
docker stats --no-stream callmeter-worker
Email: support@callmeter.io
Quick Reference: Status-to-Action Table
| Status | First Check | Second Check | Escalation |
|---|---|---|---|
| OFFLINE | Container running? | Network to wg.callmeter.io:443? | Token valid? |
| ERROR | Container logs | License active? | Worker enabled? |
| DRAINING | Wait for active tests to finish | If stuck, check for hung endpoints | Restart container |
| ONLINE but tests fail | Capacity available? | SIP server reachable? | Check SIP credentials |
Next Steps
- Configuration --- Review environment variable settings
- Networking --- Detailed firewall and NAT guidance
- Common Test Failures --- Test-level troubleshooting (not worker-specific)
- SIP Registration Errors --- SIP-specific error diagnosis
Capacity and Scaling
Plan worker capacity, understand license tiers, allocate resources per endpoint, scale horizontally with multiple workers, and monitor utilization.
API Quick Start
Get up and running with the CallMeter REST API in under 5 minutes. Create an API key, make your first call, and explore the interactive reference.