CallMeter Docs

Worker Troubleshooting

Diagnose and resolve common worker issues including connection failures, status problems, test execution errors, resource exhaustion, and token issues.

This guide covers every common worker issue, organized by symptom. Start with the symptom you are observing, follow the diagnostic steps, and apply the resolution.

Worker Shows OFFLINE

The worker is not connected to the CallMeter Worker Gateway.

Symptoms

  • Worker status shows OFFLINE in the CallMeter dashboard
  • Health endpoint returns "connected": false or is unreachable
  • No heartbeats received by the platform

Diagnostic Steps

1. Check if the container is running:

docker ps | grep callmeter-worker

If the container is not listed, it has stopped or crashed:

# Check stopped containers
docker ps -a | grep callmeter-worker

# View exit reason
docker inspect callmeter-worker --format='{{.State.ExitCode}} {{.State.Error}}'

# Check logs for crash reason
docker logs --tail 100 callmeter-worker

2. Check the worker token:

Verify the WORKER_TOKEN environment variable is set correctly:

docker exec callmeter-worker env | grep WORKER_TOKEN

The token should:

  • Start with cmw_
  • Be exactly 68 characters long
  • Match the token generated (or last regenerated) in the CallMeter UI

3. Check network connectivity to the gateway:

# DNS resolution
docker exec callmeter-worker nslookup wg.callmeter.io

# TCP connectivity
docker exec callmeter-worker nc -zv wg.callmeter.io 443

If DNS fails, check your DNS resolver configuration. If TCP connection is refused or times out, check your firewall rules for outbound TCP 443.

4. Check container logs for connection errors:

docker logs --tail 200 callmeter-worker | grep -i "error\|fail\|refused\|timeout"

Common Causes and Resolutions

CauseResolution
Container not runningdocker compose up -d or investigate crash logs
Wrong or missing tokenVerify token in UI, update WORKER_TOKEN, restart container
Firewall blocking outbound TCP 443Add firewall rule allowing outbound to wg.callmeter.io:443
DNS resolution failureCheck DNS resolver, add exception for callmeter.io
TLS inspection breaking connectionBypass TLS inspection for wg.callmeter.io
Token regenerated but container not updatedUpdate WORKER_TOKEN with new token, restart container
Worker disabled by adminRe-enable the worker in the CallMeter UI

Connection Keeps Dropping

The worker alternates between ONLINE and OFFLINE or ERROR status.

Symptoms

  • Worker status flaps between ONLINE and OFFLINE or ERROR
  • Logs show repeated "connected" and "disconnected" messages
  • Tests fail intermittently due to worker unavailability

Diagnostic Steps

1. Check connection stability:

# Watch logs for connection events
docker logs -f callmeter-worker | grep -i "connect\|disconnect\|reconnect"

Note the pattern: Is it regular (suggesting a timeout) or random (suggesting network instability)?

2. Check for OOM kills:

docker inspect callmeter-worker --format='{{.State.OOMKilled}}'

If true, the container is being killed due to memory exhaustion and restarting.

3. Check host resource pressure:

docker stats callmeter-worker --no-stream

Look for CPU or memory near limits.

4. Test network path stability:

# Continuous ping to gateway (from host)
ping -c 60 wg.callmeter.io

# Check for packet loss
mtr --report wg.callmeter.io

Common Causes and Resolutions

CauseResolution
Aggressive firewall idle timeoutEnsure firewall idle timeout exceeds 60 seconds for TCP to wg.callmeter.io
NAT device dropping long-lived connectionsConfigure NAT keepalive or add TCP keepalive rules
Container OOM-killedIncrease Docker memory limit (memory: 4G in compose)
Unstable network pathCheck for packet loss between worker host and gateway; consider a more stable network path
TLS inspection intermittently resetting connectionsBypass wg.callmeter.io from TLS inspection
Docker daemon restartsCheck journalctl -u docker for daemon restart events

Status Stuck in ERROR

The worker shows ERROR status and does not recover.

Symptoms

  • Worker shows ERROR in the dashboard
  • Worker does not transition back to ONLINE
  • Tests cannot be assigned to this worker

Diagnostic Steps

1. Check the error reason in the dashboard:

Open the worker details in CallMeter. The ERROR status may include a reason such as "heartbeat timeout" or "authentication failure."

2. Check container logs:

docker logs --tail 200 callmeter-worker | grep -i "error\|heartbeat\|auth\|license"

3. Check if the worker license is active:

Navigate to your organization's Billing page in CallMeter. Verify that your worker license subscription is active and not expired or past due.

4. Check if the worker is enabled:

In the worker details page, verify the worker is not disabled. A disabled worker shows an explicit "Disabled" indicator.

Common Causes and Resolutions

CauseResolution
Heartbeat timeout (60+ seconds without heartbeat)Restart the container; investigate network stability if recurring
License expiredRenew your worker license subscription
Subscription past dueUpdate payment method; worker resumes after payment processes
Worker disabled by adminRe-enable the worker in the UI
Token invalidated (regenerated elsewhere)Update container with current token and restart
Gateway detected anomalous behaviorContact support with worker ID and timestamp

ERROR Status Recovery

In most cases, restarting the worker container clears the ERROR state. If the underlying cause persists (expired license, invalid token), the worker will return to ERROR shortly after reconnecting.

Test Execution Failures

Tests fail to start or endpoints report errors during execution.

Symptoms

  • Test shows "No available capacity" error
  • Endpoints fail with SIP registration errors
  • Endpoints fail with media errors
  • Test starts but some endpoints never establish calls

No Available Capacity

Cause: The selected workers do not have enough free endpoint slots.

Resolution:

  1. Check each worker's capacity usage on the Workers page
  2. Wait for in-progress tests to complete, or
  3. Reduce the number of endpoints in your test, or
  4. Deploy additional workers

SIP Registration Failures

Cause: The worker cannot register with the SIP server.

Diagnostic steps:

# Check if SIP server is reachable from the worker
docker exec callmeter-worker nc -zuv <sip-server-ip> 5060

Common causes:

  • SIP server not reachable from worker's network
  • Incorrect SIP credentials in registrar configuration
  • SIP server rejecting registrations (check SIP response codes in test results)
  • Firewall blocking SIP traffic between worker and SIP server
  • Wrong transport protocol (e.g., test configured for TCP but server only accepts UDP)

RTP Media Failures

Cause: SIP calls establish but media streams fail.

Common causes:

  • Firewall blocking UDP in the RTP port range between worker and SIP/media server
  • NAT misconfiguration (wrong SDP_IP or EXTERNAL_IP)
  • RTP port range too narrow for the number of concurrent calls
  • SIP server and worker on different networks without proper routing

Resolution:

  1. Check firewall rules for bidirectional UDP in the RTP port range
  2. If behind NAT, configure SDP_IP and EXTERNAL_IP (see Configuration)
  3. Ensure the RTP port range is large enough (default 10000-65535 is sufficient for any deployment)
  4. Try switching to Docker host networking mode (network_mode: host)

Endpoints Timing Out

Cause: Endpoints start but never transition to active call state.

Common causes:

  • SIP server slow to respond (overloaded or misconfigured)
  • Buildup period too short (all endpoints registering simultaneously overwhelms the SIP server)
  • Callee group not available or misconfigured
  • DNS resolution delays for SIP server hostname

High Resource Usage

The worker consumes more CPU or memory than expected.

Symptoms

  • Container CPU at or near limit
  • Container memory growing continuously
  • Docker reports OOMKilled events
  • Test results show degraded metrics (high jitter, packet loss)

Diagnostic Steps

# Real-time resource monitoring
docker stats callmeter-worker

# Check for OOM events
docker inspect callmeter-worker --format='{{.State.OOMKilled}}'

# Check container resource limits
docker inspect callmeter-worker --format='{{.HostConfig.Memory}} {{.HostConfig.NanoCpus}}'

Resource Usage Guidelines

Endpoints (audio)Expected CPUExpected Memory
501-2 cores500 MB - 1 GB
1002-3 cores1 - 2 GB
2003-5 cores2 - 4 GB
5006-10 cores4 - 8 GB

Video calls multiply CPU usage by 5-10x and memory by 3-4x.

Resolutions

IssueResolution
CPU at limitReduce configured capacity or increase CPU limit
Memory growing without boundPossible memory leak; restart container, check for latest image
OOMKilledIncrease memory limit (memory: 8G)
Degraded metrics at high loadCapacity exceeds hardware capability; reduce endpoints or upgrade hardware
Video tests exhausting CPUUse lower video resolution/FPS, or upgrade to more cores

Degraded Results at High Load

If a worker is resource-constrained during a test, the metrics it reports will be inaccurate. High CPU usage causes scheduling delays that show up as artificial jitter and packet loss. Always ensure your worker has headroom before trusting quality metrics.

Token Issues

Problems related to the worker authentication token.

Lost Token

Symptom: You need to deploy or redeploy a worker but do not have the token.

Resolution: Tokens are shown once at creation time. If lost, you must regenerate:

  1. Open the worker in CallMeter
  2. Click Regenerate Token
  3. Copy the new token immediately
  4. Update the WORKER_TOKEN environment variable
  5. Restart the container

Regeneration Is Immediate

Regenerating a token immediately invalidates the old one. If the worker is currently connected with the old token, it will be disconnected and cannot reconnect until updated.

Invalid Token

Symptom: Logs show authentication failed: invalid token. Worker stays OFFLINE.

Possible causes:

  • Token was copied incorrectly (missing characters, extra whitespace)
  • Token was regenerated but the container was not updated
  • Token belongs to a different worker
  • Token includes surrounding quotes that are being passed as part of the value

Resolution:

  1. Verify the token is exactly 68 characters: echo -n "$WORKER_TOKEN" | wc -c
  2. Verify it starts with cmw_
  3. Check for surrounding quotes or whitespace
  4. If in doubt, regenerate and copy carefully

Duplicate Connection

Symptom: Logs show authentication failed: duplicate connection. Second worker cannot connect.

Cause: Each token can only be used by one worker simultaneously. If two containers try to connect with the same token, the second is rejected.

Resolution:

  • Ensure only one container is running with each token
  • If a previous container crashed and the connection is stale, wait up to 90 seconds for the platform to release the lock
  • Each worker in a multi-worker deployment must have its own unique token

Health Check Failures

The health endpoint reports problems or is unreachable.

Health Endpoint Unreachable

Symptom: curl http://localhost:3030/health returns "Connection refused."

Possible causes:

  • Container is not running
  • Health port is mapped differently (check HEALTH_PORT and Docker port mapping)
  • Container networking issue

Resolution:

  1. Verify the container is running: docker ps | grep callmeter
  2. Check the port mapping: docker port callmeter-worker
  3. Try the container's internal address: docker exec callmeter-worker curl http://localhost:3030/health

Status: unhealthy

Symptom: Health endpoint returns "status": "unhealthy".

Possible causes:

  • Worker process encountered an internal error
  • Resource exhaustion (CPU, memory)
  • Connection to gateway lost

Resolution:

  1. Check container logs for error messages
  2. Check resource usage with docker stats
  3. If no obvious cause, restart the container

Connected: false

Symptom: Health endpoint returns "connected": false but "status": "healthy".

Cause: The worker process is running but has no active connection to the gateway. It may be in the process of reconnecting.

Resolution:

  1. Wait 30-60 seconds --- the worker reconnects automatically with exponential backoff
  2. If it does not reconnect, check network connectivity to wg.callmeter.io:443
  3. Check logs for authentication or connection errors

Log Analysis Guide

Worker logs contain structured information for diagnosing issues. Here is how to read them.

Log Levels in Order of Severity

LevelMeaningAction
TRACEPacket-level detailInformational only; not present unless LOG_LEVEL=trace
DEBUGDetailed operational eventsUseful for diagnosing specific issues
INFONormal operationExpected during healthy operation
WARNPotential issue, non-fatalInvestigate if recurring
ERROROperation failedRequires attention
FATALWorker cannot continueWorker will exit; check cause immediately

Common Log Messages and Meanings

Log MessageLevelMeaningAction
Connecting to Worker GatewayINFOInitial connection attemptNormal startup
Authentication successfulINFOToken validated, worker registeredNormal
Status: ONLINEINFOWorker ready for test assignmentsNormal
Heartbeat sentDEBUGRegular keepaliveNormal
Connection lost, reconnectingWARNGateway connection droppedCheck network; worker will retry
Reconnecting in XsWARNBackoff before retryNormal reconnection behavior
Authentication failed: invalid tokenERRORToken rejectedVerify or regenerate token
Authentication failed: duplicate connectionERRORToken already in useStop other instance using this token
Authentication failed: worker disabledERRORWorker administratively disabledRe-enable in UI
No capacity availableWARNCannot accept more endpointsReduce test load or add workers
Endpoint registration failedWARNSIP REGISTER rejectedCheck SIP credentials and server reachability
Out of memoryFATALProcess killed by OSIncrease container memory limit

Extracting Relevant Logs

# Errors only
docker logs callmeter-worker 2>&1 | grep -i "error\|fatal"

# Connection events
docker logs callmeter-worker 2>&1 | grep -i "connect\|auth\|handshake"

# SIP events
docker logs callmeter-worker 2>&1 | grep -i "register\|invite\|sip"

# Specific time range (last 5 minutes)
docker logs --since 5m callmeter-worker

# Follow logs in real time (filtered)
docker logs -f callmeter-worker 2>&1 | grep -i "error\|warn"

When to Contact Support

Contact support if:

  • The worker repeatedly enters ERROR status with no clear cause in logs
  • Authentication succeeds but the worker never receives test assignments
  • Metrics are missing or clearly incorrect despite normal resource usage
  • You encounter a log message you cannot interpret
  • The container crashes repeatedly with FATAL errors
  • You need help with a complex network topology (multi-NAT, proxy chains, VPN tunnels)

When contacting support, include:

  1. Worker ID (visible on the Workers page in CallMeter)
  2. Container logs from the time of the issue: docker logs --since 30m callmeter-worker > worker-logs.txt
  3. Health endpoint output: curl http://localhost:3030/health
  4. Docker version: docker --version
  5. Host OS: uname -a
  6. Network test results: nc -zv wg.callmeter.io 443 output
  7. Docker resource usage: docker stats --no-stream callmeter-worker

Email: support@callmeter.io

Quick Reference: Status-to-Action Table

StatusFirst CheckSecond CheckEscalation
OFFLINEContainer running?Network to wg.callmeter.io:443?Token valid?
ERRORContainer logsLicense active?Worker enabled?
DRAININGWait for active tests to finishIf stuck, check for hung endpointsRestart container
ONLINE but tests failCapacity available?SIP server reachable?Check SIP credentials

Next Steps

On this page