Worker Statuses
Worker connection states, lifecycle transitions, heartbeat behavior, and troubleshooting guidance for cloud and user-owned workers in CallMeter.
Workers are the compute units that execute SIP endpoints for CallMeter tests and probes. Each worker maintains a persistent connection to the CallMeter platform and transitions through defined states based on its connection health and operational status. Understanding worker statuses is essential for ensuring test capacity and diagnosing execution failures.
Worker statuses are visible on the Workers page in your project dashboard, in the test run allocation view, and via the API.
Status Reference
ONLINE
The worker is connected to the CallMeter platform, passing health checks, and available to accept new endpoint assignments.
- Visual indicator: Green
- Accepts new work: Yes
- What it means: The worker has an active connection to the platform gateway, is sending heartbeats on schedule, and has reported healthy resource utilization. This is the only status in which a worker will be allocated endpoints for test runs.
- How it gets here:
- Worker starts, connects to the gateway, completes authentication handshake, and passes initial health check
- Worker recovers from ERROR state after reconnecting and passing health checks
- Worker completes draining (finishes assigned work) and remains connected (rare, typically workers go OFFLINE after draining)
OFFLINE
The worker is not connected to the CallMeter platform. It is either shut down, unreachable, or has not yet started.
- Visual indicator: Gray
- Accepts new work: No
- What it means: The platform has no active connection from this worker. The worker may be stopped, may have lost network connectivity, or may have crashed.
- How it gets here:
- Worker process is stopped or container is shut down
- Worker's heartbeat was missed for longer than the timeout period (60 seconds), and the platform marked it OFFLINE
- Worker completed draining and disconnected gracefully
- Worker was never started after registration
- Impact on running tests: If a worker goes OFFLINE while executing test endpoints, those endpoints are moved to CLOSED phase with an OSERROR outcome. The test run will report partial results for the endpoints that were on the disconnected worker.
DRAINING
The worker is finishing its currently assigned work but will not accept any new endpoint assignments. This is a transitional state used for graceful shutdown.
- Visual indicator: Yellow/Orange
- Accepts new work: No
- What it means: The worker has been instructed to stop accepting new work. It will continue executing all currently assigned endpoints until they complete their lifecycle. Once all endpoints reach CLOSED phase, the worker itself transitions to OFFLINE.
- How it gets here:
- An administrator initiates a drain operation from the Workers page or via the API
- The worker container receives a graceful shutdown signal (SIGTERM) and enters drain mode
- The platform initiates draining in preparation for a maintenance operation
- Behavior during draining:
- Currently assigned endpoints continue executing normally (registration, call setup, media exchange, teardown)
- No new test runs will allocate endpoints to this worker
- Metrics continue to be collected and reported
- Once all endpoints complete, the worker disconnects and goes OFFLINE
Graceful Shutdown with DRAINING
When you need to take a worker offline for maintenance or updates, use the drain operation rather than stopping the container immediately. Draining ensures that all in-progress endpoints complete their call lifecycle and report metrics, rather than being abruptly terminated and counted as failures.
ERROR
The worker has a connection or health issue that prevents it from accepting work. The platform has detected a problem but the worker has not fully disconnected.
- Visual indicator: Red
- Accepts new work: No
- What it means: The worker's connection is in a degraded state. This may be due to missed heartbeats (but not yet enough to declare OFFLINE), failed health checks, resource exhaustion reports, or authentication issues.
- How it gets here:
- Worker missed one or more consecutive heartbeats but has not yet exceeded the OFFLINE timeout
- Worker reported unhealthy resource utilization (high CPU, memory pressure, disk full)
- Worker authentication token was revoked or expired while the connection was active
- Network instability causing intermittent connection issues
- What happens next: If the issue resolves (heartbeats resume, health checks pass), the worker returns to ONLINE. If the issue persists beyond the timeout, the worker transitions to OFFLINE.
- Impact on running tests: Endpoints currently assigned to a worker in ERROR state continue executing (the media path is independent of the control connection). However, metric reporting may be delayed or interrupted. If the worker transitions to OFFLINE, in-progress endpoints are moved to CLOSED phase with an OSERROR outcome.
Worker Lifecycle
Initial Connection
When a worker starts, it follows this sequence:
- Connect: The worker opens a persistent connection to the CallMeter gateway on port 50052
- Authenticate: The worker sends a handshake message containing its authentication token
- Validate: The gateway validates the token against the worker's registered credentials
- Health check: The gateway requests initial resource utilization metrics from the worker
- ONLINE: If authentication and health checks pass, the worker is marked ONLINE and becomes available for endpoint allocation
The entire connection sequence typically completes in under 2 seconds on a healthy network.
Heartbeat Mechanism
Once connected, the worker sends periodic heartbeat messages to the platform:
- Heartbeat interval: Every 30 seconds
- Heartbeat content: Timestamp, current resource utilization (CPU, memory), number of active endpoints
- Missed heartbeat detection: The platform expects a heartbeat within each 30-second window. If a heartbeat is not received, an internal counter increments.
- ERROR threshold: After missing consecutive heartbeats, the worker may transition to ERROR
- OFFLINE threshold: After 60 seconds without a heartbeat (two full missed intervals), the worker is marked OFFLINE
Graceful Shutdown
The recommended shutdown sequence:
- Initiate drain: Send a drain command via the platform UI or API
- DRAINING: The worker stops accepting new work
- Wait for completion: All in-progress endpoints finish their lifecycle
- Disconnect: The worker closes its connection to the gateway
- OFFLINE: The platform marks the worker as OFFLINE
Ungraceful Shutdown
When a worker stops unexpectedly (crash, OOM kill, power failure, network partition):
- Heartbeats stop: The platform stops receiving heartbeats from the worker
- ERROR: After missed heartbeats, the worker may briefly enter ERROR state
- OFFLINE: After 60 seconds without any heartbeat, the worker is marked OFFLINE
- Endpoint failure: All endpoints that were assigned to the worker and had not yet completed are moved to CLOSED phase with an OSERROR outcome
- Test impact: If the worker was the only one assigned to a test run, the run transitions to FAILED
Cloud Workers vs User-Owned Workers
Both worker types use the same status lifecycle and heartbeat mechanism. The differences are in provisioning and authentication.
Cloud Workers
Cloud workers are managed by the CallMeter platform:
- Provisioning: Automatically deployed in CallMeter's regions (e.g., US East, EU West)
- Authentication: Internally managed by the platform
- Availability: Always available based on your plan's allocated capacity
- Maintenance: Handled by CallMeter. Workers are drained and updated during maintenance windows.
- Monitoring: Visible on the Workers page with region labels
User-Owned Workers
User-owned workers are Docker containers you deploy on your own infrastructure:
- Provisioning: You deploy the worker container on your own servers or cloud instances
- Authentication: Uses a
cmw_token (68 characters) generated in the CallMeter platform - Availability: Depends on your own infrastructure. You are responsible for uptime.
- Maintenance: You manage updates by pulling new container images
- Monitoring: Visible on the Workers page alongside cloud workers, labeled as user-owned
- Use case: Testing internal SIP infrastructure that is not reachable from the public internet
Token Security for User-Owned Workers
The cmw_ token is a sensitive credential that grants the worker the ability to execute SIP endpoints on behalf of your organization. Store it securely. If a token is compromised, regenerate it immediately from the Workers page. The old token will stop working as soon as the new one is generated.
Worker Capacity and Allocation
Each worker has a maximum endpoint capacity determined by its available resources (CPU, memory). When a test run is queued, the platform allocates endpoints across available ONLINE workers:
- Region-based allocation: If the test group specifies a region, only ONLINE workers in that region are considered
- Worker-specific allocation: If the test group specifies a particular worker, only that worker is used
- Capacity check: The platform verifies that the total requested endpoints fit within the combined capacity of available workers
- Distribution: Endpoints are distributed across workers to balance load
- Insufficient capacity: If no combination of available workers can accommodate the requested endpoints, the test run transitions to CANNOT_RUN_FOR_NOW
Troubleshooting Worker Issues
Worker Stuck in OFFLINE
| Possible Cause | Diagnostic Step | Resolution |
|---|---|---|
| Worker container not running | Check container status on the host machine | Start the worker container |
| Network firewall blocking outbound | Check if port 50052 outbound is open | Open outbound access to the gateway address on port 50052 |
| Invalid authentication token | Check worker container logs for authentication errors | Regenerate the worker token in CallMeter and update the container configuration |
| DNS resolution failure | Check if the worker can resolve the gateway hostname | Verify DNS configuration on the worker host |
| Gateway unreachable | Check network path from worker to gateway | Verify the worker can reach the CallMeter gateway endpoint |
Worker Flapping Between ONLINE and ERROR
| Possible Cause | Diagnostic Step | Resolution |
|---|---|---|
| Network instability | Check for packet loss on the worker's network | Stabilize the network connection or move to a more reliable path |
| Worker under high CPU load | Check CPU utilization on the worker host | Reduce concurrent endpoint load or add more resources |
| Memory pressure | Check memory utilization | Increase container memory limits |
| Gateway congestion | Check if other workers are also flapping | Contact CallMeter support if the issue is widespread |
Worker Goes OFFLINE During Test
When a worker disconnects while a test is running:
- Endpoints on that worker are moved to CLOSED phase with an OSERROR outcome
- The test run continues with endpoints on other workers (if any)
- If the disconnected worker was the only one in the run, the run transitions to FAILED
- Check the worker's container logs for crash reasons (OOM, segfault, panic)
- Review the test's endpoint count relative to the worker's capacity
Related Pages
- Workers Overview -- Worker concepts and types
- Cloud Workers -- Platform-managed workers
- Deploying Your Own Workers -- User-owned worker setup
- Worker Configuration -- Configuration options
- Test Run Statuses -- How worker issues affect test runs
- Common Test Failures -- Diagnosing worker-related failures
Probe Health States
How CallMeter evaluates probe health, the four health states, threshold evaluation logic, transition rules, webhook triggers, and best practices for threshold configuration.
Common Test Failures
Step-by-step diagnostic guide for the most frequent causes of failed test runs in CallMeter, including registration failures, call setup errors, media issues, and worker problems.