Worker Statuses

Worker connection states, lifecycle transitions, heartbeat behavior, and troubleshooting guidance for cloud and user-owned workers in CallMeter.

Workers are the compute units that execute SIP endpoints for CallMeter tests and probes. Each worker maintains a persistent connection to the CallMeter platform and transitions through defined states based on its connection health and operational status. Understanding worker statuses is essential for ensuring test capacity and diagnosing execution failures.

Worker statuses are visible on the Workers page in your project dashboard, in the test run allocation view, and via the API.

Status Reference

ONLINE

The worker is connected to the CallMeter platform, passing health checks, and available to accept new endpoint assignments.

Visual indicator: Green
Accepts new work: Yes
What it means: The worker has an active connection to the platform gateway, is sending heartbeats on schedule, and has reported healthy resource utilization. This is the only status in which a worker will be allocated endpoints for test runs.
How it gets here:
- Worker starts, connects to the gateway, completes authentication handshake, and passes initial health check
- Worker recovers from ERROR state after reconnecting and passing health checks
- Worker completes draining (finishes assigned work) and remains connected (rare, typically workers go OFFLINE after draining)

OFFLINE

The worker is not connected to the CallMeter platform. It is either shut down, unreachable, or has not yet started.

Visual indicator: Gray
Accepts new work: No
What it means: The platform has no active connection from this worker. The worker may be stopped, may have lost network connectivity, or may have crashed.
How it gets here:
- Worker process is stopped or container is shut down
- Worker's heartbeat was missed for longer than the timeout period (60 seconds), and the platform marked it OFFLINE
- Worker completed draining and disconnected gracefully
- Worker was never started after registration
Impact on running tests: If a worker goes OFFLINE while executing test endpoints, those endpoints are moved to CLOSED phase with an OSERROR outcome. The test run will report partial results for the endpoints that were on the disconnected worker.

DRAINING

The worker is finishing its currently assigned work but will not accept any new endpoint assignments. This is a transitional state used for graceful shutdown.

Visual indicator: Yellow/Orange
Accepts new work: No
What it means: The worker has been instructed to stop accepting new work. It will continue executing all currently assigned endpoints until they complete their lifecycle. Once all endpoints reach CLOSED phase, the worker itself transitions to OFFLINE.
How it gets here:
- An administrator initiates a drain operation from the Workers page or via the API
- The worker container receives a graceful shutdown signal (SIGTERM) and enters drain mode
- The platform initiates draining in preparation for a maintenance operation
Behavior during draining:
- Currently assigned endpoints continue executing normally (registration, call setup, media exchange, teardown)
- No new test runs will allocate endpoints to this worker
- Metrics continue to be collected and reported
- Once all endpoints complete, the worker disconnects and goes OFFLINE

Graceful Shutdown with DRAINING

When you need to take a worker offline for maintenance or updates, use the drain operation rather than stopping the container immediately. Draining ensures that all in-progress endpoints complete their call lifecycle and report metrics, rather than being abruptly terminated and counted as failures.

ERROR

The worker has a connection or health issue that prevents it from accepting work. The platform has detected a problem but the worker has not fully disconnected.

Visual indicator: Red
Accepts new work: No
What it means: The worker's connection is in a degraded state. This may be due to missed heartbeats (but not yet enough to declare OFFLINE), failed health checks, resource exhaustion reports, or authentication issues.
How it gets here:
- Worker missed one or more consecutive heartbeats but has not yet exceeded the OFFLINE timeout
- Worker reported unhealthy resource utilization (high CPU, memory pressure, disk full)
- Worker authentication token was revoked or expired while the connection was active
- Network instability causing intermittent connection issues
What happens next: If the issue resolves (heartbeats resume, health checks pass), the worker returns to ONLINE. If the issue persists beyond the timeout, the worker transitions to OFFLINE.
Impact on running tests: Endpoints currently assigned to a worker in ERROR state continue executing (the media path is independent of the control connection). However, metric reporting may be delayed or interrupted. If the worker transitions to OFFLINE, in-progress endpoints are moved to CLOSED phase with an OSERROR outcome.

Worker Lifecycle

Initial Connection

When a worker starts, it follows this sequence:

Connect: The worker opens a persistent connection to the CallMeter gateway on port 50052
Authenticate: The worker sends a handshake message containing its authentication token
Validate: The gateway validates the token against the worker's registered credentials
Health check: The gateway requests initial resource utilization metrics from the worker
ONLINE: If authentication and health checks pass, the worker is marked ONLINE and becomes available for endpoint allocation

The entire connection sequence typically completes in under 2 seconds on a healthy network.

Heartbeat Mechanism

Once connected, the worker sends periodic heartbeat messages to the platform:

Heartbeat interval: Every 30 seconds
Heartbeat content: Timestamp, current resource utilization (CPU, memory), number of active endpoints
Missed heartbeat detection: The platform expects a heartbeat within each 30-second window. If a heartbeat is not received, an internal counter increments.
ERROR threshold: After missing consecutive heartbeats, the worker may transition to ERROR
OFFLINE threshold: After 60 seconds without a heartbeat (two full missed intervals), the worker is marked OFFLINE

Graceful Shutdown

The recommended shutdown sequence:

Initiate drain: Send a drain command via the platform UI or API
DRAINING: The worker stops accepting new work
Wait for completion: All in-progress endpoints finish their lifecycle
Disconnect: The worker closes its connection to the gateway
OFFLINE: The platform marks the worker as OFFLINE

Ungraceful Shutdown

When a worker stops unexpectedly (crash, OOM kill, power failure, network partition):

Heartbeats stop: The platform stops receiving heartbeats from the worker
ERROR: After missed heartbeats, the worker may briefly enter ERROR state
OFFLINE: After 60 seconds without any heartbeat, the worker is marked OFFLINE
Endpoint failure: All endpoints that were assigned to the worker and had not yet completed are moved to CLOSED phase with an OSERROR outcome
Test impact: If the worker was the only one assigned to a test run, the run transitions to FAILED

Cloud Workers vs User-Owned Workers

Both worker types use the same status lifecycle and heartbeat mechanism. The differences are in provisioning and authentication.

Cloud Workers

Cloud workers are managed by the CallMeter platform:

Provisioning: Automatically deployed in CallMeter's regions (e.g., US East, EU West)
Authentication: Internally managed by the platform
Availability: Always available based on your plan's allocated capacity
Maintenance: Handled by CallMeter. Workers are drained and updated during maintenance windows.
Monitoring: Visible on the Workers page with region labels

User-Owned Workers

User-owned workers are Docker containers you deploy on your own infrastructure:

Provisioning: You deploy the worker container on your own servers or cloud instances
Authentication: Uses a cmw_ token (68 characters) generated in the CallMeter platform
Availability: Depends on your own infrastructure. You are responsible for uptime.
Maintenance: You manage updates by pulling new container images
Monitoring: Visible on the Workers page alongside cloud workers, labeled as user-owned
Use case: Testing internal SIP infrastructure that is not reachable from the public internet

Token Security for User-Owned Workers

The cmw_ token is a sensitive credential that grants the worker the ability to execute SIP endpoints on behalf of your organization. Store it securely. If a token is compromised, regenerate it immediately from the Workers page. The old token will stop working as soon as the new one is generated.

Worker Capacity and Allocation

Each worker has a maximum endpoint capacity determined by its available resources (CPU, memory). When a test run is queued, the platform allocates endpoints across available ONLINE workers:

Region-based allocation: If the test group specifies a region, only ONLINE workers in that region are considered
Worker-specific allocation: If the test group specifies a particular worker, only that worker is used
Capacity check: The platform verifies that the total requested endpoints fit within the combined capacity of available workers
Distribution: Endpoints are distributed across workers to balance load
Insufficient capacity: If no combination of available workers can accommodate the requested endpoints, the test run transitions to CANNOT_RUN_FOR_NOW

Troubleshooting Worker Issues

Worker Stuck in OFFLINE

Possible Cause	Diagnostic Step	Resolution
Worker container not running	Check container status on the host machine	Start the worker container
Network firewall blocking outbound	Check if port 50052 outbound is open	Open outbound access to the gateway address on port 50052
Invalid authentication token	Check worker container logs for authentication errors	Regenerate the worker token in CallMeter and update the container configuration
DNS resolution failure	Check if the worker can resolve the gateway hostname	Verify DNS configuration on the worker host
Gateway unreachable	Check network path from worker to gateway	Verify the worker can reach the CallMeter gateway endpoint

Worker Flapping Between ONLINE and ERROR

Possible Cause	Diagnostic Step	Resolution
Network instability	Check for packet loss on the worker's network	Stabilize the network connection or move to a more reliable path
Worker under high CPU load	Check CPU utilization on the worker host	Reduce concurrent endpoint load or add more resources
Memory pressure	Check memory utilization	Increase container memory limits
Gateway congestion	Check if other workers are also flapping	Contact CallMeter support if the issue is widespread

Worker Goes OFFLINE During Test

When a worker disconnects while a test is running:

Endpoints on that worker are moved to CLOSED phase with an OSERROR outcome
The test run continues with endpoints on other workers (if any)
If the disconnected worker was the only one in the run, the run transitions to FAILED
Check the worker's container logs for crash reasons (OOM, segfault, panic)
Review the test's endpoint count relative to the worker's capacity

Workers Overview -- Worker concepts and types
Cloud Workers -- Platform-managed workers
Deploying Your Own Workers -- User-owned worker setup
Worker Configuration -- Configuration options
Test Run Statuses -- How worker issues affect test runs
Common Test Failures -- Diagnosing worker-related failures

Worker Statuses

On this page