Worker Troubleshooting

Diagnose and resolve common worker issues including connection failures, status problems, test execution errors, resource exhaustion, and token issues.

This guide covers every common worker issue, organized by symptom. Start with the symptom you are observing, follow the diagnostic steps, and apply the resolution.

Worker Shows OFFLINE

The worker is not connected to the CallMeter Worker Gateway.

Symptoms

Worker status shows OFFLINE in the CallMeter dashboard
Health endpoint returns "connected": false or is unreachable
No heartbeats received by the platform

Diagnostic Steps

1. Check if the container is running:

docker ps | grep callmeter-worker

If the container is not listed, it has stopped or crashed:

# Check stopped containers
docker ps -a | grep callmeter-worker

# View exit reason
docker inspect callmeter-worker --format='{{.State.ExitCode}} {{.State.Error}}'

# Check logs for crash reason
docker logs --tail 100 callmeter-worker

2. Check the worker token:

Verify the WORKER_TOKEN environment variable is set correctly:

docker exec callmeter-worker env | grep WORKER_TOKEN

The token should:

Start with cmw_
Be exactly 68 characters long
Match the token generated (or last regenerated) in the CallMeter UI

3. Check network connectivity to the gateway:

# DNS resolution
docker exec callmeter-worker nslookup wg.callmeter.io

# TCP connectivity
docker exec callmeter-worker nc -zv wg.callmeter.io 443

If DNS fails, check your DNS resolver configuration. If TCP connection is refused or times out, check your firewall rules for outbound TCP 443.

4. Check container logs for connection errors:

docker logs --tail 200 callmeter-worker | grep -i "error\|fail\|refused\|timeout"

Common Causes and Resolutions

Cause	Resolution
Container not running	`docker compose up -d` or investigate crash logs
Wrong or missing token	Verify token in UI, update `WORKER_TOKEN`, restart container
Firewall blocking outbound TCP 443	Add firewall rule allowing outbound to `wg.callmeter.io:443`
DNS resolution failure	Check DNS resolver, add exception for `callmeter.io`
TLS inspection breaking connection	Bypass TLS inspection for `wg.callmeter.io`
Token regenerated but container not updated	Update `WORKER_TOKEN` with new token, restart container
Worker disabled by admin	Re-enable the worker in the CallMeter UI

Connection Keeps Dropping

The worker alternates between ONLINE and OFFLINE or ERROR status.

Symptoms

Worker status flaps between ONLINE and OFFLINE or ERROR
Logs show repeated "connected" and "disconnected" messages
Tests fail intermittently due to worker unavailability

Diagnostic Steps

1. Check connection stability:

# Watch logs for connection events
docker logs -f callmeter-worker | grep -i "connect\|disconnect\|reconnect"

Note the pattern: Is it regular (suggesting a timeout) or random (suggesting network instability)?

2. Check for OOM kills:

docker inspect callmeter-worker --format='{{.State.OOMKilled}}'

If true, the container is being killed due to memory exhaustion and restarting.

3. Check host resource pressure:

docker stats callmeter-worker --no-stream

Look for CPU or memory near limits.

4. Test network path stability:

# Continuous ping to gateway (from host)
ping -c 60 wg.callmeter.io

# Check for packet loss
mtr --report wg.callmeter.io

Common Causes and Resolutions

Cause	Resolution
Aggressive firewall idle timeout	Ensure firewall idle timeout exceeds 60 seconds for TCP to `wg.callmeter.io`
NAT device dropping long-lived connections	Configure NAT keepalive or add TCP keepalive rules
Container OOM-killed	Increase Docker memory limit (`memory: 4G` in compose)
Unstable network path	Check for packet loss between worker host and gateway; consider a more stable network path
TLS inspection intermittently resetting connections	Bypass `wg.callmeter.io` from TLS inspection
Docker daemon restarts	Check `journalctl -u docker` for daemon restart events

Status Stuck in ERROR

The worker shows ERROR status and does not recover.

Symptoms

Worker shows ERROR in the dashboard
Worker does not transition back to ONLINE
Tests cannot be assigned to this worker

Diagnostic Steps

1. Check the error reason in the dashboard:

Open the worker details in CallMeter. The ERROR status may include a reason such as "heartbeat timeout" or "authentication failure."

2. Check container logs:

docker logs --tail 200 callmeter-worker | grep -i "error\|heartbeat\|auth\|license"

3. Check if the worker license is active:

Navigate to your organization's Billing page in CallMeter. Verify that your worker license subscription is active and not expired or past due.

4. Check if the worker is enabled:

In the worker details page, verify the worker is not disabled. A disabled worker shows an explicit "Disabled" indicator.

Common Causes and Resolutions

Cause	Resolution
Heartbeat timeout (60+ seconds without heartbeat)	Restart the container; investigate network stability if recurring
License expired	Renew your worker license subscription
Subscription past due	Update payment method; worker resumes after payment processes
Worker disabled by admin	Re-enable the worker in the UI
Token invalidated (regenerated elsewhere)	Update container with current token and restart
Gateway detected anomalous behavior	Contact support with worker ID and timestamp

ERROR Status Recovery

In most cases, restarting the worker container clears the ERROR state. If the underlying cause persists (expired license, invalid token), the worker will return to ERROR shortly after reconnecting.

Test Execution Failures

Tests fail to start or endpoints report errors during execution.

Symptoms

Test shows "No available capacity" error
Endpoints fail with SIP registration errors
Endpoints fail with media errors
Test starts but some endpoints never establish calls

No Available Capacity

Cause: The selected workers do not have enough free endpoint slots.

Resolution:

Check each worker's capacity usage on the Workers page
Wait for in-progress tests to complete, or
Reduce the number of endpoints in your test, or
Deploy additional workers

SIP Registration Failures

Cause: The worker cannot register with the SIP server.

Diagnostic steps:

# Check if SIP server is reachable from the worker
docker exec callmeter-worker nc -zuv <sip-server-ip> 5060

Common causes:

SIP server not reachable from worker's network
Incorrect SIP credentials in registrar configuration
SIP server rejecting registrations (check SIP response codes in test results)
Firewall blocking SIP traffic between worker and SIP server
Wrong transport protocol (e.g., test configured for TCP but server only accepts UDP)

RTP Media Failures

Cause: SIP calls establish but media streams fail.

Common causes:

Firewall blocking UDP in the RTP port range between worker and SIP/media server
NAT misconfiguration (wrong SDP_IP or EXTERNAL_IP)
RTP port range too narrow for the number of concurrent calls
SIP server and worker on different networks without proper routing

Resolution:

Check firewall rules for bidirectional UDP in the RTP port range
If behind NAT, configure SDP_IP and EXTERNAL_IP (see Configuration)
Ensure the RTP port range is large enough (default 10000-65535 is sufficient for any deployment)
Try switching to Docker host networking mode (network_mode: host)

Endpoints Timing Out

Cause: Endpoints start but never transition to active call state.

Common causes:

SIP server slow to respond (overloaded or misconfigured)
Buildup period too short (all endpoints registering simultaneously overwhelms the SIP server)
Callee group not available or misconfigured
DNS resolution delays for SIP server hostname

Endpoints (audio)	Expected CPU	Expected Memory
50	1-2 cores	500 MB - 1 GB
100	2-3 cores	1 - 2 GB
200	3-5 cores	2 - 4 GB
500	6-10 cores	4 - 8 GB

Issue	Resolution
CPU at limit	Reduce configured capacity or increase CPU limit
Memory growing without bound	Possible memory leak; restart container, check for latest image
OOMKilled	Increase memory limit (`memory: 8G`)
Degraded metrics at high load	Capacity exceeds hardware capability; reduce endpoints or upgrade hardware
Video tests exhausting CPU	Use lower video resolution/FPS, or upgrade to more cores

If a worker is resource-constrained during a test, the metrics it reports will be inaccurate. High CPU usage causes scheduling delays that show up as artificial jitter and packet loss. Always ensure your worker has headroom before trusting quality metrics.

Token Issues

Problems related to the worker authentication token.

Lost Token

Symptom: You need to deploy or redeploy a worker but do not have the token.

Resolution: Tokens are shown once at creation time. If lost, you must regenerate:

Open the worker in CallMeter
Click Regenerate Token
Copy the new token immediately
Update the WORKER_TOKEN environment variable
Restart the container

Regeneration Is Immediate

Regenerating a token immediately invalidates the old one. If the worker is currently connected with the old token, it will be disconnected and cannot reconnect until updated.

Invalid Token

Symptom: Logs show authentication failed: invalid token. Worker stays OFFLINE.

Possible causes:

Token was copied incorrectly (missing characters, extra whitespace)
Token was regenerated but the container was not updated
Token belongs to a different worker
Token includes surrounding quotes that are being passed as part of the value

Resolution:

Verify the token is exactly 68 characters: echo -n "$WORKER_TOKEN" | wc -c
Verify it starts with cmw_
Check for surrounding quotes or whitespace
If in doubt, regenerate and copy carefully

Duplicate Connection

Symptom: Logs show authentication failed: duplicate connection. Second worker cannot connect.

Cause: Each token can only be used by one worker simultaneously. If two containers try to connect with the same token, the second is rejected.

Resolution:

Ensure only one container is running with each token
If a previous container crashed and the connection is stale, wait up to 90 seconds for the platform to release the lock
Each worker in a multi-worker deployment must have its own unique token

Health Check Failures

The health endpoint reports problems or is unreachable.

Health Endpoint Unreachable

Symptom: curl http://localhost:3030/health returns "Connection refused."

Possible causes:

Container is not running
Health port is mapped differently (check HEALTH_PORT and Docker port mapping)
Container networking issue

Resolution:

Verify the container is running: docker ps | grep callmeter
Check the port mapping: docker port callmeter-worker
Try the container's internal address: docker exec callmeter-worker curl http://localhost:3030/health

Status: unhealthy

Symptom: Health endpoint returns "status": "unhealthy".

Possible causes:

Worker process encountered an internal error
Resource exhaustion (CPU, memory)
Connection to gateway lost

Resolution:

Check container logs for error messages
Check resource usage with docker stats
If no obvious cause, restart the container

Connected: false

Symptom: Health endpoint returns "connected": false but "status": "healthy".

Cause: The worker process is running but has no active connection to the gateway. It may be in the process of reconnecting.

Resolution:

Wait 30-60 seconds --- the worker reconnects automatically with exponential backoff
If it does not reconnect, check network connectivity to wg.callmeter.io:443
Check logs for authentication or connection errors

Log Analysis Guide

Worker logs contain structured information for diagnosing issues. Here is how to read them.

Log Levels in Order of Severity

Level	Meaning	Action
`TRACE`	Packet-level detail	Informational only; not present unless `LOG_LEVEL=trace`
`DEBUG`	Detailed operational events	Useful for diagnosing specific issues
`INFO`	Normal operation	Expected during healthy operation
`WARN`	Potential issue, non-fatal	Investigate if recurring
`ERROR`	Operation failed	Requires attention
`FATAL`	Worker cannot continue	Worker will exit; check cause immediately

Common Log Messages and Meanings

Log Message	Level	Meaning	Action
`Connecting to Worker Gateway`	INFO	Initial connection attempt	Normal startup
`Authentication successful`	INFO	Token validated, worker registered	Normal
`Status: ONLINE`	INFO	Worker ready for test assignments	Normal
`Heartbeat sent`	DEBUG	Regular keepalive	Normal
`Connection lost, reconnecting`	WARN	Gateway connection dropped	Check network; worker will retry
`Reconnecting in Xs`	WARN	Backoff before retry	Normal reconnection behavior
`Authentication failed: invalid token`	ERROR	Token rejected	Verify or regenerate token
`Authentication failed: duplicate connection`	ERROR	Token already in use	Stop other instance using this token
`Authentication failed: worker disabled`	ERROR	Worker administratively disabled	Re-enable in UI
`No capacity available`	WARN	Cannot accept more endpoints	Reduce test load or add workers
`Endpoint registration failed`	WARN	SIP REGISTER rejected	Check SIP credentials and server reachability
`Out of memory`	FATAL	Process killed by OS	Increase container memory limit

Extracting Relevant Logs

# Errors only
docker logs callmeter-worker 2>&1 | grep -i "error\|fatal"

# Connection events
docker logs callmeter-worker 2>&1 | grep -i "connect\|auth\|handshake"

# SIP events
docker logs callmeter-worker 2>&1 | grep -i "register\|invite\|sip"

# Specific time range (last 5 minutes)
docker logs --since 5m callmeter-worker

# Follow logs in real time (filtered)
docker logs -f callmeter-worker 2>&1 | grep -i "error\|warn"

When to Contact Support

Contact support if:

The worker repeatedly enters ERROR status with no clear cause in logs
Authentication succeeds but the worker never receives test assignments
Metrics are missing or clearly incorrect despite normal resource usage
You encounter a log message you cannot interpret
The container crashes repeatedly with FATAL errors
You need help with a complex network topology (multi-NAT, proxy chains, VPN tunnels)

When contacting support, include:

Worker ID (visible on the Workers page in CallMeter)
Container logs from the time of the issue: docker logs --since 30m callmeter-worker > worker-logs.txt
Health endpoint output: curl http://localhost:3030/health
Docker version: docker --version
Host OS: uname -a
Network test results: nc -zv wg.callmeter.io 443 output
Docker resource usage: docker stats --no-stream callmeter-worker

Email: support@callmeter.io

Quick Reference: Status-to-Action Table

Status	First Check	Second Check	Escalation
OFFLINE	Container running?	Network to `wg.callmeter.io:443`?	Token valid?
ERROR	Container logs	License active?	Worker enabled?
DRAINING	Wait for active tests to finish	If stuck, check for hung endpoints	Restart container
ONLINE but tests fail	Capacity available?	SIP server reachable?	Check SIP credentials

Next Steps

Configuration --- Review environment variable settings
Networking --- Detailed firewall and NAT guidance
Common Test Failures --- Test-level troubleshooting (not worker-specific)
SIP Registration Errors --- SIP-specific error diagnosis

Worker Troubleshooting

Worker Shows OFFLINE

Symptoms

Diagnostic Steps

Common Causes and Resolutions

Connection Keeps Dropping

Symptoms

Diagnostic Steps

Common Causes and Resolutions

Status Stuck in ERROR

Symptoms

Diagnostic Steps

Common Causes and Resolutions

Test Execution Failures

Symptoms

No Available Capacity

SIP Registration Failures

RTP Media Failures

Endpoints Timing Out

High Resource Usage

Symptoms

Diagnostic Steps

Resource Usage Guidelines

Resolutions

Token Issues

Lost Token

Invalid Token

Duplicate Connection

Health Check Failures

Health Endpoint Unreachable

Status: unhealthy

Connected: false

Log Analysis Guide

Log Levels in Order of Severity

Common Log Messages and Meanings

Extracting Relevant Logs

When to Contact Support

Quick Reference: Status-to-Action Table

Next Steps

On this page