Probe Health States
How CallMeter evaluates probe health, the four health states, threshold evaluation logic, transition rules, webhook triggers, and best practices for threshold configuration.
Probes are scheduled monitoring tests that run continuously to assess the health of your SIP infrastructure. After each probe execution, CallMeter evaluates the collected metrics against your configured thresholds and assigns a health state. This page documents the four health states, the evaluation logic, and best practices for threshold configuration.
The Four Health States
HEALTHY
All metrics from the most recent probe execution fall within acceptable ranges. The SIP infrastructure under test is performing as expected.
- Visual indicator: Green
- What it means: Every configured threshold metric is within its "good" range. MOS is above the warning threshold. Jitter, RTT, and packet loss are below their warning thresholds.
- Action required: None. The probe will continue executing at its configured interval.
- Webhook behavior: If the probe was previously DEGRADED or UNHEALTHY, a recovery webhook is sent on the transition to HEALTHY.
DEGRADED
One or more metrics from the most recent probe execution have breached the warning threshold but remain within the critical threshold. Quality is below optimal but has not reached an emergency level.
- Visual indicator: Yellow/Orange
- What it means: The SIP infrastructure is showing signs of stress. For example, jitter may have risen above 30ms (warning) but remains below 80ms (critical), or MOS may have dropped below 4.0 but remains above 3.0.
- Action required: Investigate the contributing metrics. DEGRADED is an early warning that may precede a full outage.
- Webhook behavior: A webhook is sent when the probe transitions from HEALTHY to DEGRADED or from UNHEALTHY to DEGRADED.
- Common causes:
- Network congestion causing moderate jitter or packet loss increases
- SIP server load approaching capacity
- DNS resolution delays increasing RTT
- Background maintenance on network equipment
UNHEALTHY
One or more metrics from the most recent probe execution have breached the critical threshold. The SIP infrastructure has a significant quality or availability problem.
- Visual indicator: Red
- What it means: At least one metric is in the critical range. For example, packet loss may exceed 5%, MOS may be below 3.0, or the call may have failed entirely.
- Action required: Immediate investigation. The SIP infrastructure is experiencing a service-impacting issue.
- Webhook behavior: A webhook is sent when the probe transitions to UNHEALTHY from any other state.
- Common causes:
- Network outage or severe congestion
- SIP server down or unreachable
- Authentication failure (credentials changed, account locked)
- Firewall rule change blocking SIP or RTP traffic
- Total call failure (SIP registration or call setup failed)
UNKNOWN
The probe has insufficient data to determine health, or an evaluation error occurred. This is the initial state for new probes and the fallback state when something prevents health evaluation.
- Visual indicator: Gray
- What it means: One of several situations:
- The probe has never executed (newly created)
- The most recent execution failed before metrics could be collected
- The probe configuration is incomplete (no thresholds configured)
- An internal evaluation error prevented health assessment
- Action required: Check the probe configuration and ensure it has executed at least once. If the probe has executed, check the run detail for errors.
- Webhook behavior: A webhook is sent when a previously known state transitions to UNKNOWN.
Threshold Evaluation Logic
CallMeter evaluates probe health immediately after each probe execution completes. The evaluation follows a strict process.
Step 1: Collect Final Metrics
After the probe's call completes, the platform retrieves the aggregate metric values from the execution. These are the average values over the call duration for time-series metrics (jitter, RTT, packet loss, MOS) and the final values for cumulative metrics.
Step 2: Evaluate Each Threshold
Each metric with a configured threshold is compared against its warning and critical values. CallMeter handles two types of metrics:
Lower-is-better metrics (jitter, RTT, packet loss):
- Value below the warning threshold: this metric is Healthy
- Value at or above the warning threshold but below the critical threshold: this metric is Degraded
- Value at or above the critical threshold: this metric is Unhealthy
Higher-is-better metrics (MOS):
- Value above the warning threshold: this metric is Healthy
- Value at or below the warning threshold but above the critical threshold: this metric is Degraded
- Value at or below the critical threshold: this metric is Unhealthy
Inverted Threshold Logic
For metrics like MOS where higher is better, the warning threshold is a higher number than the critical threshold. For example, a MOS warning of 3.8 and critical of 3.0 means: above 3.8 is healthy, 3.0 to 3.8 is degraded, and below 3.0 is unhealthy. This is the opposite direction from jitter or loss thresholds.
Step 3: Determine Overall Health
The probe's overall health state is determined by the worst individual metric evaluation. If any single metric is UNHEALTHY, the probe is UNHEALTHY. If no metric is UNHEALTHY but any metric is DEGRADED, the probe is DEGRADED. Only if all evaluated metrics are within healthy ranges does the probe report HEALTHY.
This "worst metric wins" approach ensures that a single degraded dimension is never hidden by otherwise healthy metrics.
Step 4: Check for State Transition
The newly determined health state is compared against the probe's previous state. If the state has changed, the transition is recorded, the probe's current status is updated, and any configured webhooks are triggered.
Status Transitions and Webhooks
Any transition between health states triggers configured webhooks. The possible transitions are:
| From | To | Severity | Meaning |
|---|---|---|---|
| HEALTHY | DEGRADED | Warning | Quality declining, early warning |
| HEALTHY | UNHEALTHY | Critical | Sudden quality failure |
| DEGRADED | UNHEALTHY | Critical | Quality continuing to decline |
| DEGRADED | HEALTHY | Recovery | Quality restored from warning state |
| UNHEALTHY | DEGRADED | Partial recovery | Critical issue partially resolved |
| UNHEALTHY | HEALTHY | Full recovery | Critical issue fully resolved |
| Any | UNKNOWN | Data issue | Cannot evaluate health |
| UNKNOWN | Any | Resolution | Health evaluation restored |
HEALTHY to UNHEALTHY Is Possible
A probe can jump directly from HEALTHY to UNHEALTHY in a single execution. This happens when a sudden, severe issue occurs (e.g., the SIP server goes down, firewall blocks all traffic). There is no requirement to pass through DEGRADED first.
Webhook Payload
When a status transition occurs, CallMeter sends an HTTP POST to the configured webhook URL with a JSON payload containing the probe ID, probe name, previous status, new status, timestamp, and the metric values that triggered the transition. See Webhooks for the full payload format and security details.
Consecutive Failure Behavior
Each probe execution independently evaluates health. If a probe alternates between HEALTHY and DEGRADED across successive executions, each transition triggers a webhook. This can produce noise for borderline thresholds.
To reduce noise from flapping thresholds:
- Increase threshold margins: Set warning and critical thresholds farther apart so that minor fluctuations do not cause transitions
- Adjust probe interval: A longer interval (30 or 60 minutes) reduces the frequency of evaluations and therefore the frequency of potential transitions
- Use appropriate metric windows: The evaluation uses average metric values over the call duration, which naturally smooths short-term spikes
Status Pages
Probe health states power CallMeter's public status pages. When you enable a status page for a probe, the current health state is displayed publicly:
| Health State | Status Page Display |
|---|---|
| HEALTHY | Operational (green) |
| DEGRADED | Degraded Performance (yellow) |
| UNHEALTHY | Major Outage (red) |
| UNKNOWN | Under Maintenance (gray) |
Status pages update automatically on each health state transition. See Status Pages for configuration details.
Threshold Configuration Best Practices
Start with Industry Baselines
If you are unsure what thresholds to configure, start with industry standard baselines and adjust based on your environment:
| Metric | Warning Threshold | Critical Threshold | Notes |
|---|---|---|---|
| MOS | 3.8 | 3.0 | MOS below 3.0 indicates poor quality |
| Jitter | 30 ms | 80 ms | Above 80ms significantly impacts audio |
| Packet Loss | 1% | 5% | Above 5% makes conversation difficult |
| RTT | 150 ms | 300 ms | Above 300ms causes noticeable delay |
Calibrate to Your Baseline
Run several probe executions without thresholds to establish your environment's normal metric ranges. Set warning thresholds at 1.5 to 2 times your normal values and critical thresholds at 3 to 4 times your normal values. This approach catches genuine degradation without alerting on normal variation.
Separate Thresholds by Route
Different SIP routes have different baseline quality characteristics. A probe monitoring a local data center path will have lower normal jitter than a probe monitoring an international route. Configure thresholds per probe based on the expected quality of each monitored path.
Avoid Over-Monitoring
Configuring thresholds on too many metrics can cause false alerts because the "worst metric wins" evaluation becomes more sensitive with more metrics. Focus thresholds on the 3 to 5 metrics most relevant to your use case:
- Voice quality monitoring: MOS, jitter, packet loss
- Network path monitoring: RTT, packet loss, jitter
- Capacity monitoring: Registration success, call setup time
Review and Adjust
Revisit threshold configuration monthly. As your SIP infrastructure evolves (new routes, capacity changes, codec updates), your baseline quality will shift. Adjust thresholds to match the new reality.
Related Pages
- Creating a Probe -- How to set up a probe
- Threshold Configuration -- Detailed threshold setup
- Webhooks -- Webhook configuration and payload format
- Status Pages -- Public health dashboards
- Test Run Statuses -- Run lifecycle for probe executions
Test Run Statuses
Complete reference for test run lifecycle statuses, transitions, failure causes, and diagnostic steps for each state in CallMeter.
Worker Statuses
Worker connection states, lifecycle transitions, heartbeat behavior, and troubleshooting guidance for cloud and user-owned workers in CallMeter.