CallMeter Docs

Worker Statuses

Worker connection states, lifecycle transitions, heartbeat behavior, and troubleshooting guidance for cloud and user-owned workers in CallMeter.

Workers are the compute units that execute SIP endpoints for CallMeter tests and probes. Each worker maintains a persistent connection to the CallMeter platform and transitions through defined states based on its connection health and operational status. Understanding worker statuses is essential for ensuring test capacity and diagnosing execution failures.

Worker statuses are visible on the Workers page in your project dashboard, in the test run allocation view, and via the API.

Status Reference

ONLINE

The worker is connected to the CallMeter platform, passing health checks, and available to accept new endpoint assignments.

  • Visual indicator: Green
  • Accepts new work: Yes
  • What it means: The worker has an active connection to the platform gateway, is sending heartbeats on schedule, and has reported healthy resource utilization. This is the only status in which a worker will be allocated endpoints for test runs.
  • How it gets here:
    • Worker starts, connects to the gateway, completes authentication handshake, and passes initial health check
    • Worker recovers from ERROR state after reconnecting and passing health checks
    • Worker completes draining (finishes assigned work) and remains connected (rare, typically workers go OFFLINE after draining)

OFFLINE

The worker is not connected to the CallMeter platform. It is either shut down, unreachable, or has not yet started.

  • Visual indicator: Gray
  • Accepts new work: No
  • What it means: The platform has no active connection from this worker. The worker may be stopped, may have lost network connectivity, or may have crashed.
  • How it gets here:
    • Worker process is stopped or container is shut down
    • Worker's heartbeat was missed for longer than the timeout period (60 seconds), and the platform marked it OFFLINE
    • Worker completed draining and disconnected gracefully
    • Worker was never started after registration
  • Impact on running tests: If a worker goes OFFLINE while executing test endpoints, those endpoints are moved to CLOSED phase with an OSERROR outcome. The test run will report partial results for the endpoints that were on the disconnected worker.

DRAINING

The worker is finishing its currently assigned work but will not accept any new endpoint assignments. This is a transitional state used for graceful shutdown.

  • Visual indicator: Yellow/Orange
  • Accepts new work: No
  • What it means: The worker has been instructed to stop accepting new work. It will continue executing all currently assigned endpoints until they complete their lifecycle. Once all endpoints reach CLOSED phase, the worker itself transitions to OFFLINE.
  • How it gets here:
    • An administrator initiates a drain operation from the Workers page or via the API
    • The worker container receives a graceful shutdown signal (SIGTERM) and enters drain mode
    • The platform initiates draining in preparation for a maintenance operation
  • Behavior during draining:
    • Currently assigned endpoints continue executing normally (registration, call setup, media exchange, teardown)
    • No new test runs will allocate endpoints to this worker
    • Metrics continue to be collected and reported
    • Once all endpoints complete, the worker disconnects and goes OFFLINE

Graceful Shutdown with DRAINING

When you need to take a worker offline for maintenance or updates, use the drain operation rather than stopping the container immediately. Draining ensures that all in-progress endpoints complete their call lifecycle and report metrics, rather than being abruptly terminated and counted as failures.

ERROR

The worker has a connection or health issue that prevents it from accepting work. The platform has detected a problem but the worker has not fully disconnected.

  • Visual indicator: Red
  • Accepts new work: No
  • What it means: The worker's connection is in a degraded state. This may be due to missed heartbeats (but not yet enough to declare OFFLINE), failed health checks, resource exhaustion reports, or authentication issues.
  • How it gets here:
    • Worker missed one or more consecutive heartbeats but has not yet exceeded the OFFLINE timeout
    • Worker reported unhealthy resource utilization (high CPU, memory pressure, disk full)
    • Worker authentication token was revoked or expired while the connection was active
    • Network instability causing intermittent connection issues
  • What happens next: If the issue resolves (heartbeats resume, health checks pass), the worker returns to ONLINE. If the issue persists beyond the timeout, the worker transitions to OFFLINE.
  • Impact on running tests: Endpoints currently assigned to a worker in ERROR state continue executing (the media path is independent of the control connection). However, metric reporting may be delayed or interrupted. If the worker transitions to OFFLINE, in-progress endpoints are moved to CLOSED phase with an OSERROR outcome.

Worker Lifecycle

Initial Connection

When a worker starts, it follows this sequence:

  1. Connect: The worker opens a persistent connection to the CallMeter gateway on port 50052
  2. Authenticate: The worker sends a handshake message containing its authentication token
  3. Validate: The gateway validates the token against the worker's registered credentials
  4. Health check: The gateway requests initial resource utilization metrics from the worker
  5. ONLINE: If authentication and health checks pass, the worker is marked ONLINE and becomes available for endpoint allocation

The entire connection sequence typically completes in under 2 seconds on a healthy network.

Heartbeat Mechanism

Once connected, the worker sends periodic heartbeat messages to the platform:

  • Heartbeat interval: Every 30 seconds
  • Heartbeat content: Timestamp, current resource utilization (CPU, memory), number of active endpoints
  • Missed heartbeat detection: The platform expects a heartbeat within each 30-second window. If a heartbeat is not received, an internal counter increments.
  • ERROR threshold: After missing consecutive heartbeats, the worker may transition to ERROR
  • OFFLINE threshold: After 60 seconds without a heartbeat (two full missed intervals), the worker is marked OFFLINE

Graceful Shutdown

The recommended shutdown sequence:

  1. Initiate drain: Send a drain command via the platform UI or API
  2. DRAINING: The worker stops accepting new work
  3. Wait for completion: All in-progress endpoints finish their lifecycle
  4. Disconnect: The worker closes its connection to the gateway
  5. OFFLINE: The platform marks the worker as OFFLINE

Ungraceful Shutdown

When a worker stops unexpectedly (crash, OOM kill, power failure, network partition):

  1. Heartbeats stop: The platform stops receiving heartbeats from the worker
  2. ERROR: After missed heartbeats, the worker may briefly enter ERROR state
  3. OFFLINE: After 60 seconds without any heartbeat, the worker is marked OFFLINE
  4. Endpoint failure: All endpoints that were assigned to the worker and had not yet completed are moved to CLOSED phase with an OSERROR outcome
  5. Test impact: If the worker was the only one assigned to a test run, the run transitions to FAILED

Cloud Workers vs User-Owned Workers

Both worker types use the same status lifecycle and heartbeat mechanism. The differences are in provisioning and authentication.

Cloud Workers

Cloud workers are managed by the CallMeter platform:

  • Provisioning: Automatically deployed in CallMeter's regions (e.g., US East, EU West)
  • Authentication: Internally managed by the platform
  • Availability: Always available based on your plan's allocated capacity
  • Maintenance: Handled by CallMeter. Workers are drained and updated during maintenance windows.
  • Monitoring: Visible on the Workers page with region labels

User-Owned Workers

User-owned workers are Docker containers you deploy on your own infrastructure:

  • Provisioning: You deploy the worker container on your own servers or cloud instances
  • Authentication: Uses a cmw_ token (68 characters) generated in the CallMeter platform
  • Availability: Depends on your own infrastructure. You are responsible for uptime.
  • Maintenance: You manage updates by pulling new container images
  • Monitoring: Visible on the Workers page alongside cloud workers, labeled as user-owned
  • Use case: Testing internal SIP infrastructure that is not reachable from the public internet

Token Security for User-Owned Workers

The cmw_ token is a sensitive credential that grants the worker the ability to execute SIP endpoints on behalf of your organization. Store it securely. If a token is compromised, regenerate it immediately from the Workers page. The old token will stop working as soon as the new one is generated.

Worker Capacity and Allocation

Each worker has a maximum endpoint capacity determined by its available resources (CPU, memory). When a test run is queued, the platform allocates endpoints across available ONLINE workers:

  • Region-based allocation: If the test group specifies a region, only ONLINE workers in that region are considered
  • Worker-specific allocation: If the test group specifies a particular worker, only that worker is used
  • Capacity check: The platform verifies that the total requested endpoints fit within the combined capacity of available workers
  • Distribution: Endpoints are distributed across workers to balance load
  • Insufficient capacity: If no combination of available workers can accommodate the requested endpoints, the test run transitions to CANNOT_RUN_FOR_NOW

Troubleshooting Worker Issues

Worker Stuck in OFFLINE

Possible CauseDiagnostic StepResolution
Worker container not runningCheck container status on the host machineStart the worker container
Network firewall blocking outboundCheck if port 50052 outbound is openOpen outbound access to the gateway address on port 50052
Invalid authentication tokenCheck worker container logs for authentication errorsRegenerate the worker token in CallMeter and update the container configuration
DNS resolution failureCheck if the worker can resolve the gateway hostnameVerify DNS configuration on the worker host
Gateway unreachableCheck network path from worker to gatewayVerify the worker can reach the CallMeter gateway endpoint

Worker Flapping Between ONLINE and ERROR

Possible CauseDiagnostic StepResolution
Network instabilityCheck for packet loss on the worker's networkStabilize the network connection or move to a more reliable path
Worker under high CPU loadCheck CPU utilization on the worker hostReduce concurrent endpoint load or add more resources
Memory pressureCheck memory utilizationIncrease container memory limits
Gateway congestionCheck if other workers are also flappingContact CallMeter support if the issue is widespread

Worker Goes OFFLINE During Test

When a worker disconnects while a test is running:

  1. Endpoints on that worker are moved to CLOSED phase with an OSERROR outcome
  2. The test run continues with endpoints on other workers (if any)
  3. If the disconnected worker was the only one in the run, the run transitions to FAILED
  4. Check the worker's container logs for crash reasons (OOM, segfault, panic)
  5. Review the test's endpoint count relative to the worker's capacity

On this page