product: maestro audience: test-developer, operator, ai-assistant authority: normative
MCP Server — Statistics Tools
The statistics tools let an AI assistant (or a human operator via the AI chat interface) answer yield, quality, and traceability questions about test data without writing SQL or opening a dashboard.
Architecture overview
AI assistant
│ get_yield / get_measurement_statistics / …
▼
WorkflowEngine.McpServer ──HTTP──▶ WorkflowEngine.Api
│ GET /api/statistics/*
▼
stat_cache table (PostgreSQL)
▲
StatisticsBackgroundService
(WorkflowEngine.Orchestra, 60 s poll)
│ reads
TestExecutions / StepExecutions /
MeasurementValues tables
Hot path: The API controller checks the stat_cache table first (valid for 60 minutes).
If the cache is warm it returns in milliseconds. If it is stale or absent it falls back to
a live query over the execution tables.
Cache writer: StatisticsBackgroundService runs in Orchestra, polls every 60 seconds,
and writes pre-computed yield and Cp/Cpk rows for every test definition whose executions
changed since the last run. It only processes executions that completed more than 5 minutes
ago to avoid racing with in-flight writes.
Tools reference
See tools-reference.md for the full parameter table. Quick summary:
| Tool | Question it answers |
|---|---|
list_measurement_names |
"What measurement points are available on this station?" |
get_capability_summary |
"Which measurements have poor Cpk across the whole programme?" |
get_yield |
"What is the pass rate for FinalTest this week?" |
get_measurement_statistics |
"Is the OutputVoltage measurement process capable (Cp/Cpk ≥ 1.33)?" |
get_yield_trend |
"How has yield trended over the last 3 months?" |
get_worst_steps |
"Which steps fail most often?" |
get_unit_history |
"What tests has serial number SN-1042 been through?" |
Yield (get_yield)
Returns pass rate for a test or a single step within a test.
get_yield(testName="FinalTest")
get_yield(testName="FinalTest", stepName="RF Calibration", from="2025-01-01T00:00:00Z")
get_yield(testName="FinalTest", stationId="station-A")
Response fields:
| Field | Type | Description |
|---|---|---|
testName |
string | Echoed from request |
stepName |
string? | Null for test-level yield |
stationId |
string? | Null for fleet-wide aggregate |
totalRuns |
int | Total executions in the date range |
passCount |
int | Executions with verdict PASS |
failCount |
int | totalRuns - passCount |
yieldPercent |
double | passCount / totalRuns × 100, rounded to 2 d.p. |
from / to |
DateTimeOffset? | Echoed from request |
Measurement name discovery (list_measurement_names)
Before querying statistics you need the exact measurementName as it is registered in
the test definition. Use list_measurement_names to discover what names exist.
Results are deduplicated by name — each distinct name appears once.
list_measurement_names() # all names (up to 50)
list_measurement_names(search="VOUT") # all names containing "VOUT"
list_measurement_names(search="resistance", limit=100)
Response fields per entry:
| Field | Description |
|---|---|
name |
Exact name to pass to get_measurement_statistics |
lowerLimit |
Lower specification limit (indicative — from one of potentially several versions) |
upperLimit |
Upper specification limit (indicative) |
units |
Engineering unit, e.g. mV, Ω, °C |
definitionVersions |
Number of distinct specification versions for this name |
sampleCount |
Pre-computed sample count from stat_cache; null if no executions yet |
cp |
Pre-computed Cp from stat_cache; null if no limits, σ = 0, or no samples |
cpk |
Pre-computed Cpk from stat_cache; null if no limits, σ = 0, or no samples |
The sampleCount, cp, and cpk fields come from the stat_cache and allow callers to
identify poor-capability measurements without making N follow-up calls. They are null
for measurements with no executions yet. For measurements with definitionVersions > 1
these values reflect the most-recently-computed cache entry rather than pooled statistics
— always follow up with get_measurement_statistics before acting on Cp/Cpk in that case.
To find all measurements below a Cpk threshold in one call, use
get_capability_summary instead.
What definitionVersions tells you:
Every time the low_limit, high_limit, or unit fields are changed in the test YAML
and the package is redeployed, a new definition row is inserted with the same name but a
different content hash. definitionVersions counts those rows.
definitionVersions = 1— the specification has never changed. Statistics are straightforward: all samples were recorded against the same limits, and Cp/Cpk is directly comparable across the entire date range.definitionVersions > 1— the specification was changed at least once. This has consequences for how statistics should be interpreted. Read the Caveat: limit changes section before proceeding.
Capability summary (get_capability_summary)
Returns all measurement points whose Cpk is below a given threshold, sorted worst-first.
Served entirely from the pre-computed stat_cache — one call replaces the N+1 pattern
of calling list_measurement_names followed by get_measurement_statistics for every
entry when scanning a large population for quality problems.
get_capability_summary() # all with Cpk < 1.33
get_capability_summary(maxCpk=1.00) # only actively defective
get_capability_summary(maxCpk=1.33, testName="FinalTest") # scoped to one test
get_capability_summary(maxCpk=1.67, stationId="station-A") # scoped to one station
When to use vs get_measurement_statistics:
| Use case | Tool |
|---|---|
| "Which measurements are poorly capable?" (fleet scan) | get_capability_summary |
| "Is VOUT_5V capable? Give me full detail." | get_measurement_statistics |
| "What measurements exist?" | list_measurement_names |
Response fields per item:
| Field | Description |
|---|---|
measurementName |
Measurement point name |
testName |
Test definition this entry came from |
stationId |
Station — null for fleet-wide cache entries |
sampleCount |
Number of numeric samples in the cache row |
mean / stdDev / min / max |
Descriptive statistics |
lowerLimit / upperLimit / units |
Specification from the cache entry |
cp / cpk |
Capability indices (always non-null in this response — entries with null Cpk are excluded) |
definitionVersions |
> 1 signals that limits changed; follow up with get_measurement_statistics for pooled analysis |
computedAt |
UTC timestamp when this cache row was last written |
Important: When a measurement name appears in multiple tests, only the row with the
lowest Cpk (worst case) is returned. The testName field identifies which test produced it.
definitionVersions caveat: The stat_cache stores per-definition statistics rather
than pooled. For definitionVersions > 1, the Cp/Cpk here is for one definition version
only. Use get_measurement_statistics (which pools all versions) for the authoritative
figure before concluding the process is or is not capable.
Measurement statistics (get_measurement_statistics)
Returns descriptive statistics and process capability indices for a named numeric measurement point. Non-numeric measurement points (boolean, string) are ignored.
Samples are always pooled across all specification versions of the name. This means
sampleCount always reflects the complete historical population — there is no silent
truncation when limits changed.
Typical workflow:
- Call
list_measurement_names(search="...")to find the exact name and checkdefinitionVersions. - If
definitionVersions > 1, read the caveat below before interpreting results. - Call
get_measurement_statistics(measurementName="...")to get statistics. - If
limitsNoteis non-null, show it to the user verbatim before presenting Cp/Cpk. - Optionally narrow the sample set with
testName,stationId,from,to.
get_measurement_statistics(measurementName="OutputVoltage_mV")
get_measurement_statistics(measurementName="OutputVoltage_mV", testName="PowerSupplyTest", from="2025-06-01T00:00:00Z")
get_measurement_statistics(measurementName="VOUT_5V", stationId="station-A")
Cp and Cpk explained
Process capability indices answer the question: "Given how much this measurement naturally varies, how likely is it to stay within specification?"
Cp measures how wide the specification window is relative to the natural process spread:
Cp = (USL − LSL) / (6σ)
A Cp of 1.0 means the specification window is exactly 6σ wide — in a perfectly centred process, 99.73 % of parts would pass. A Cp of 1.33 gives a safety margin of ±4σ on each side (standard production target).
Cpk additionally penalises a process that is off-centre — it tells you the actual margin to the nearest limit:
Cpk = min( (USL − μ) / (3σ), (μ − LSL) / (3σ) )
If Cp > Cpk the process is off-centre. A process can have a high Cp (wide spec) but a
low Cpk (mean drifted close to one limit), which will produce failures even though the
tolerance is generous.
| Value | Interpretation |
|---|---|
| < 1.00 | Process is not capable — defects expected even with a centred process |
| 1.00 – 1.33 | Marginally capable — monitor closely, little room for drift |
| ≥ 1.33 | Capable — standard production target |
| ≥ 1.67 | Highly capable — Six-Sigma territory |
When Cp/Cpk are applicable:
- Both
lowerLimitandupperLimitmust be defined in the test YAML (measurement:block). - At least a few dozen samples are needed for meaningful indices (σ becomes unstable below ~30 samples).
- σ must be > 0 (all-identical samples indicate a stuck sensor, not a real process).
Both values are null when any of these conditions are not met.
How to use the results:
| Observation | Suggested action |
|---|---|
| Cp ≥ 1.33, Cpk ≥ 1.33 | Process is well-controlled — no action needed |
| Cp ≥ 1.33, Cpk < 1.00 | Spec is wide enough but the mean has drifted — re-centre the process |
| Cp < 1.00 | Process spread is too wide — tighten the process, not the spec |
| cp / cpk are null | Add low_limit / high_limit to the measurement YAML, or investigate σ = 0 |
Cp and Cpk are null when:
- The measurement definition has no lower or upper limit, or
- σ = 0 (all samples are identical — likely a fixture or software issue)
Response fields:
| Field | Type | Description |
|---|---|---|
measurementName |
string | Measurement definition name |
sampleCount |
int | Total numeric samples across all definition versions |
mean |
double | Arithmetic mean of all pooled samples |
stdDev |
double | Population standard deviation of all pooled samples |
min / max |
double | Observed extremes across all pooled samples |
lowerLimit / upperLimit |
double? | Specification limits from the most recently active definition |
units |
string? | Engineering unit from the most recently active definition |
cp |
double? | Process capability computed against current limits (null if limits unavailable or σ = 0) |
cpk |
double? | Process capability index computed against current limits |
passCount / failCount |
int | Verdicts on individual measurement values across all versions |
definitionVersions |
int | Number of distinct specification versions for this name |
limitsNote |
string? | Non-null when definitionVersions > 1. Explains the pooling and which spec was used. Always show to the user. |
Yield trend (get_yield_trend)
Returns yield bucketed by day, week, or month.
get_yield_trend(testName="FinalTest", bucket="week", from="2025-01-01T00:00:00Z")
Each point in trend.points[]:
| Field | Description |
|---|---|
bucketStart |
UTC start of the bucket |
bucketLabel |
Human-readable label, e.g. "W23 2025", "Jun 2025", "2025-06-04" |
totalRuns |
Executions in this bucket |
passCount |
Passing executions |
yieldPercent |
Pass rate for this bucket |
Worst steps (get_worst_steps)
Returns the steps with the lowest pass rate, ranked ascending.
get_worst_steps(testName="FinalTest", top=5)
get_worst_steps(testName="FinalTest", stationId="station-A", from="2025-06-01T00:00:00Z")
Use this tool to focus engineering attention: a step with 60 % yield is a stronger cost-reduction candidate than improving a 95 % step.
Unit history (get_unit_history)
Returns all test runs for a serial number, newest first.
get_unit_history(serialNumber="SN-1042")
get_unit_history(serialNumber="SN-1042", page=2, pageSize=10)
Use this tool to answer "why did this unit fail?" or to reconstruct the full test history for a warranty return.
Caveat: specification limit changes split the population
This section explains what happens when lowerLimit or upperLimit are changed in the
test YAML and the package is redeployed.
What happens at the database level
Maestro uses content-hash identity for measurement definitions. Every time a test
package is loaded, the YAML content of each measurement block is hashed. If the hash
already exists in MeasurementDefinitions, the existing row is reused. If the hash is
new (because limits changed), a new row is inserted with the same name but a
different Id and different lowerLimit/upperLimit.
Values recorded before the change point at the old definition Id. Values recorded after
point at the new definition Id. Both rows have the same name, but querying by Id
returns non-overlapping populations.
What this means for statistics
Before this fix was implemented, get_measurement_statistics used FirstOrDefault when
resolving the definition — it returned a database-order-dependent row, silently querying
only the samples attached to that one definition. A programme with limits tightened at
month 3 of 6 could return half the true sample population with no warning.
The endpoint now:
- Loads all definition rows for the name.
- Queries values from all of them (the full population).
- Identifies the most recently active definition (the one whose values carry the
latest timestamp) and uses its limits for
lowerLimit,upperLimit, Cp, and Cpk. - Returns
definitionVersions(the count) andlimitsNote(a plain-English explanation whendefinitionVersions > 1).
How to use the response correctly
Step 1 — check definitionVersions
{ "definitionVersions": 2, "sampleCount": 1924, ... }
If definitionVersions is 1, nothing special. If it is 2 or more, proceed to step 2.
Step 2 — read and show limitsNote verbatim
{
"limitsNote": "Specification limits for 'VOUT_5V' changed 1 time(s) during the programme,
creating 2 definition versions. All 1924 samples are pooled across all versions.
The lowerLimit, upperLimit, and Cp/Cpk values reflect the most recently active
specification only. Samples recorded under older specifications may have been
accepted or rejected against different limits."
}
Always present this text to the user before showing Cp/Cpk. Do not paraphrase or omit it. The user needs to know that some historical samples were gated against different limits before deciding whether the pooled Cp/Cpk is meaningful for their purpose.
Step 3 — decide whether the pooled statistics answer the question
Ask: "Was the limit change intentional process knowledge (tighter spec, same process) or a correction of an error (wrong limits deployed for a period)?"
| Scenario | Appropriate action |
|---|---|
| Limits were tightened as the process matured | Pooled statistics are conservative — Cp/Cpk may be lower than the process actually achieves under the current spec. Consider adding a from date filter to isolate the current-spec period. |
| An incorrect limit was deployed by mistake | The pooled sampleCount is inflated. Use from/to to filter to the correct-spec period only. |
| Limits were loosened (relaxed) | Some historical pass/fail verdicts no longer reflect the current gate. The passCount/failCount in the response reflects the verdict at the time of measurement, not re-evaluated against current limits. |
Step 4 — narrow the sample set if needed
If the user needs statistics for the current specification only, filter by date:
get_measurement_statistics(
measurementName="VOUT_5V",
from="<date the new spec was deployed>T00:00:00Z"
)
This restricts TestExecution.StartTime to after the change, effectively isolating the
current-spec population.
How to present data from a multi-version measurement to users
When reporting on a measurement where definitionVersions > 1, a well-formed response
looks like this:
VOUT_5V — 1924 samples pooled across 2 specification versions.
Mean: 4.972 V | StdDev: 0.031 V | Min: 4.891 V | Max: 5.048 V
Current specification: 4.75–5.25 V → Cp: 1.07, Cpk: 0.96 (marginally capable)
⚠️ Note: Specification limits changed once during the programme. These statistics cover the full population (both before and after the change). Cp/Cpk reflect the current specification only. If you need statistics for the current spec period alone, filter using the
fromparameter with the date the new spec was deployed.
If the user asks a direct question like "Is VOUT_5V capable?", the correct answer is:
"Based on the pooled population of 1924 samples, Cpk is 0.96 against the current spec — marginally not capable. However, the limits were changed once during the programme, so some of these samples were gated against a different specification. To get a precise answer for the current spec, provide the date from which the current limits have been in use."
Never answer "Yes, it is capable" or "No, it is not capable" without disclosing
definitionVersions > 1 when it applies.
Example AI prompts
The following prompts work well with Claude, GitHub Copilot, or any MCP-compatible client connected to a running Maestro MCP server.
Yield analysis
What was the yield for FinalTest last month?
Which step in FinalTest failed most often in the last 30 days?
Show me the yield trend for FinalTest week by week since January.
Measurement name discovery
What measurement points are available on this station?
List all voltage-related measurements I can analyse.
Measurement quality
Is OutputVoltage_mV in control for the PowerSupplyTest? Give me Cp and Cpk.
What were the min, max, and mean for ContactResistance_Ohm last week?
Which measurements have a Cpk below 1.33? Start by listing all measurement names.
Show me all measurements that are not process capable (use get_capability_summary).
Which measurements are actively producing defects (Cpk < 1.00)?
Unit traceability
What tests has serial number SN-1042 completed? Did it ever fail?
Background service
StatisticsBackgroundService (in WorkflowEngine.Orchestra) continuously pre-computes
statistics so that API responses are fast even for large datasets.
| Behaviour | Detail |
|---|---|
| Poll interval | 60 seconds |
| Processing guard | Only processes executions completed > 5 minutes ago |
| Watermark | Uses the latest ComputedAt timestamp in stat_cache as a cursor to avoid reprocessing old data |
| Cache validity | API controller serves cached data if less than 60 minutes old |
| Upsert strategy | Uses EF FirstOrDefault + update; unique index on (TestName, StepName, MeasurementName, StationId, BucketType, BucketDate) prevents duplicates |
The service writes the following row types per test definition:
| Row type | StepName |
MeasurementName |
Description |
|---|---|---|---|
| Test yield — fleet-wide | null | null | Overall pass rate, all stations |
| Test yield — per station | null | null | Pass rate per StationId |
| Step yield — fleet-wide | set | null | Pass rate per step |
| Measurement statistics | null | set | Mean, σ, Cp, Cpk per measurement |
Troubleshooting
get_yield returns totalRuns: 0
- Check that
testNamematches thenamefield in the test YAML exactly (case-sensitive). - The stat_cache may not yet have data — wait up to 65 seconds after the first execution completes.
sampleCount is lower than expected
- Check
definitionVersions. If it is> 1, the pooling is working and the count is correct — all versions are included. Re-readlimitsNoteand consider whether afromdate filter is appropriate. - If
definitionVersionsis 1 and the count still seems low, check thefrom/todate filter and verify thetestName/stationIdfilters are not over-restricting.
cp and cpk are null
- The current measurement definition has no
lowerLimit/upperLimitin the YAML. - Add
low_limitandhigh_limitto the measurement definition and re-run a test. - If
definitionVersions > 1: prior versions may have had limits but the current one does not. Check the YAML and confirm limits are present in the deployed version.
Statistics are stale
- Check that
WorkflowEngine.Orchestrais running and healthy. - Inspect Orchestra logs for
StatisticsBackgroundServiceerror messages. - Measurements from executions completed less than 5 minutes ago will not appear yet.
passCount seems wrong when limits changed
passCountandfailCountreflect the verdict recorded at the time of each measurement — i.e., evaluated against the limits that were active when the test ran.- They are not re-evaluated when limits change. This is by design: the verdict is a historical fact, not a projection.
- If you need to re-evaluate historical samples against the current spec, that requires
a custom query against the raw
MeasurementValuestable.
Log search (search_execution_logs)
Finds log text from past executions using PostgreSQL full-text search. For full details see live-events.md.
search_execution_logs(query="timeout error")
search_execution_logs(query="calibration failure", testExecutionId="<uuid>")
search_execution_logs(query="drift", stepName="RF Calibration")
Indexing latency: Logs are indexed by EmbeddingBackgroundService in Orchestra
5–65 seconds after execution completion. If totalHits is 0 for a very recent execution,
wait and retry.