product: maestro audience: test-developer, operator, ai-assistant authority: normative

MCP Server — Statistics Tools

The statistics tools let an AI assistant (or a human operator via the AI chat interface) answer yield, quality, and traceability questions about test data without writing SQL or opening a dashboard.

Architecture overview

AI assistant
    │  get_yield / get_measurement_statistics / …
    ▼
WorkflowEngine.McpServer  ──HTTP──▶  WorkflowEngine.Api
                                         │  GET /api/statistics/*
                                         ▼
                                    stat_cache table  (PostgreSQL)
                                         ▲
                               StatisticsBackgroundService
                               (WorkflowEngine.Orchestra, 60 s poll)
                                         │  reads
                                    TestExecutions / StepExecutions /
                                    MeasurementValues tables

Hot path: The API controller checks the stat_cache table first (valid for 60 minutes). If the cache is warm it returns in milliseconds. If it is stale or absent it falls back to a live query over the execution tables.

Cache writer: StatisticsBackgroundService runs in Orchestra, polls every 60 seconds, and writes pre-computed yield and Cp/Cpk rows for every test definition whose executions changed since the last run. It only processes executions that completed more than 5 minutes ago to avoid racing with in-flight writes.

Tools reference

See tools-reference.md for the full parameter table. Quick summary:

Tool	Question it answers
`list_measurement_names`	"What measurement points are available on this station?"
`get_capability_summary`	"Which measurements have poor Cpk across the whole programme?"
`get_yield`	"What is the pass rate for FinalTest this week?"
`get_measurement_statistics`	"Is the OutputVoltage measurement process capable (Cp/Cpk ≥ 1.33)?"
`get_yield_trend`	"How has yield trended over the last 3 months?"
`get_worst_steps`	"Which steps fail most often?"
`get_unit_history`	"What tests has serial number SN-1042 been through?"

Yield (`get_yield`)

Returns pass rate for a test or a single step within a test.

get_yield(testName="FinalTest")
get_yield(testName="FinalTest", stepName="RF Calibration", from="2025-01-01T00:00:00Z")
get_yield(testName="FinalTest", stationId="station-A")

Response fields:

Field	Type	Description
`testName`	string	Echoed from request
`stepName`	string?	Null for test-level yield
`stationId`	string?	Null for fleet-wide aggregate
`totalRuns`	int	Total executions in the date range
`passCount`	int	Executions with verdict PASS
`failCount`	int	`totalRuns - passCount`
`yieldPercent`	double	`passCount / totalRuns × 100`, rounded to 2 d.p.
`from` / `to`	DateTimeOffset?	Echoed from request

Measurement name discovery (`list_measurement_names`)

Before querying statistics you need the exact measurementName as it is registered in the test definition. Use list_measurement_names to discover what names exist.

Results are deduplicated by name — each distinct name appears once.

list_measurement_names()                           # all names (up to 50)
list_measurement_names(search="VOUT")              # all names containing "VOUT"
list_measurement_names(search="resistance", limit=100)

Response fields per entry:

Field	Description
`name`	Exact name to pass to `get_measurement_statistics`
`lowerLimit`	Lower specification limit (indicative — from one of potentially several versions)
`upperLimit`	Upper specification limit (indicative)
`units`	Engineering unit, e.g. `mV`, `Ω`, `°C`
`definitionVersions`	Number of distinct specification versions for this name
`sampleCount`	Pre-computed sample count from stat_cache; `null` if no executions yet
`cp`	Pre-computed Cp from stat_cache; `null` if no limits, σ = 0, or no samples
`cpk`	Pre-computed Cpk from stat_cache; `null` if no limits, σ = 0, or no samples

The sampleCount, cp, and cpk fields come from the stat_cache and allow callers to identify poor-capability measurements without making N follow-up calls. They are null for measurements with no executions yet. For measurements with definitionVersions > 1 these values reflect the most-recently-computed cache entry rather than pooled statistics — always follow up with get_measurement_statistics before acting on Cp/Cpk in that case.

To find all measurements below a Cpk threshold in one call, use get_capability_summary instead.

What definitionVersions tells you:

Every time the low_limit, high_limit, or unit fields are changed in the test YAML and the package is redeployed, a new definition row is inserted with the same name but a different content hash. definitionVersions counts those rows.

definitionVersions = 1 — the specification has never changed. Statistics are straightforward: all samples were recorded against the same limits, and Cp/Cpk is directly comparable across the entire date range.
definitionVersions > 1 — the specification was changed at least once. This has consequences for how statistics should be interpreted. Read the Caveat: limit changes section before proceeding.

Capability summary (`get_capability_summary`)

Returns all measurement points whose Cpk is below a given threshold, sorted worst-first. Served entirely from the pre-computed stat_cache — one call replaces the N+1 pattern of calling list_measurement_names followed by get_measurement_statistics for every entry when scanning a large population for quality problems.

get_capability_summary()                                       # all with Cpk < 1.33
get_capability_summary(maxCpk=1.00)                           # only actively defective
get_capability_summary(maxCpk=1.33, testName="FinalTest")     # scoped to one test
get_capability_summary(maxCpk=1.67, stationId="station-A")    # scoped to one station

When to use vs get_measurement_statistics:

Use case	Tool
"Which measurements are poorly capable?" (fleet scan)	`get_capability_summary`
"Is VOUT_5V capable? Give me full detail."	`get_measurement_statistics`
"What measurements exist?"	`list_measurement_names`

Response fields per item:

Field	Description
`measurementName`	Measurement point name
`testName`	Test definition this entry came from
`stationId`	Station — null for fleet-wide cache entries
`sampleCount`	Number of numeric samples in the cache row
`mean` / `stdDev` / `min` / `max`	Descriptive statistics
`lowerLimit` / `upperLimit` / `units`	Specification from the cache entry
`cp` / `cpk`	Capability indices (always non-null in this response — entries with null Cpk are excluded)
`definitionVersions`	`> 1` signals that limits changed; follow up with `get_measurement_statistics` for pooled analysis
`computedAt`	UTC timestamp when this cache row was last written

Important: When a measurement name appears in multiple tests, only the row with the lowest Cpk (worst case) is returned. The testName field identifies which test produced it.

definitionVersions caveat: The stat_cache stores per-definition statistics rather than pooled. For definitionVersions > 1, the Cp/Cpk here is for one definition version only. Use get_measurement_statistics (which pools all versions) for the authoritative figure before concluding the process is or is not capable.

Measurement statistics (`get_measurement_statistics`)

Returns descriptive statistics and process capability indices for a named numeric measurement point. Non-numeric measurement points (boolean, string) are ignored.

Samples are always pooled across all specification versions of the name. This means sampleCount always reflects the complete historical population — there is no silent truncation when limits changed.

Typical workflow:

Call list_measurement_names(search="...") to find the exact name and check definitionVersions.
If definitionVersions > 1, read the caveat below before interpreting results.
Call get_measurement_statistics(measurementName="...") to get statistics.
If limitsNote is non-null, show it to the user verbatim before presenting Cp/Cpk.
Optionally narrow the sample set with testName, stationId, from, to.

get_measurement_statistics(measurementName="OutputVoltage_mV")
get_measurement_statistics(measurementName="OutputVoltage_mV", testName="PowerSupplyTest", from="2025-06-01T00:00:00Z")
get_measurement_statistics(measurementName="VOUT_5V", stationId="station-A")

Cp and Cpk explained

Process capability indices answer the question: "Given how much this measurement naturally varies, how likely is it to stay within specification?"

Cp measures how wide the specification window is relative to the natural process spread:

Cp = (USL − LSL) / (6σ)

A Cp of 1.0 means the specification window is exactly 6σ wide — in a perfectly centred process, 99.73 % of parts would pass. A Cp of 1.33 gives a safety margin of ±4σ on each side (standard production target).

Cpk additionally penalises a process that is off-centre — it tells you the actual margin to the nearest limit:

Cpk = min( (USL − μ) / (3σ),  (μ − LSL) / (3σ) )

If Cp > Cpk the process is off-centre. A process can have a high Cp (wide spec) but a low Cpk (mean drifted close to one limit), which will produce failures even though the tolerance is generous.

Value	Interpretation
< 1.00	Process is not capable — defects expected even with a centred process
1.00 – 1.33	Marginally capable — monitor closely, little room for drift
≥ 1.33	Capable — standard production target
≥ 1.67	Highly capable — Six-Sigma territory

When Cp/Cpk are applicable:

Both lowerLimit and upperLimit must be defined in the test YAML (measurement: block).
At least a few dozen samples are needed for meaningful indices (σ becomes unstable below ~30 samples).
σ must be > 0 (all-identical samples indicate a stuck sensor, not a real process).

Both values are null when any of these conditions are not met.

How to use the results:

Observation	Suggested action
Cp ≥ 1.33, Cpk ≥ 1.33	Process is well-controlled — no action needed
Cp ≥ 1.33, Cpk < 1.00	Spec is wide enough but the mean has drifted — re-centre the process
Cp < 1.00	Process spread is too wide — tighten the process, not the spec
cp / cpk are null	Add `low_limit` / `high_limit` to the measurement YAML, or investigate σ = 0

Cp and Cpk are null when:

The measurement definition has no lower or upper limit, or
σ = 0 (all samples are identical — likely a fixture or software issue)

Response fields:

Field	Type	Description
`measurementName`	string	Measurement definition name
`sampleCount`	int	Total numeric samples across all definition versions
`mean`	double	Arithmetic mean of all pooled samples
`stdDev`	double	Population standard deviation of all pooled samples
`min` / `max`	double	Observed extremes across all pooled samples
`lowerLimit` / `upperLimit`	double?	Specification limits from the most recently active definition
`units`	string?	Engineering unit from the most recently active definition
`cp`	double?	Process capability computed against current limits (null if limits unavailable or σ = 0)
`cpk`	double?	Process capability index computed against current limits
`passCount` / `failCount`	int	Verdicts on individual measurement values across all versions
`definitionVersions`	int	Number of distinct specification versions for this name
`limitsNote`	string?	Non-null when `definitionVersions > 1`. Explains the pooling and which spec was used. Always show to the user.

Yield trend (`get_yield_trend`)

Returns yield bucketed by day, week, or month.

get_yield_trend(testName="FinalTest", bucket="week", from="2025-01-01T00:00:00Z")

Each point in trend.points[]:

Field	Description
`bucketStart`	UTC start of the bucket
`bucketLabel`	Human-readable label, e.g. `"W23 2025"`, `"Jun 2025"`, `"2025-06-04"`
`totalRuns`	Executions in this bucket
`passCount`	Passing executions
`yieldPercent`	Pass rate for this bucket

Worst steps (`get_worst_steps`)

Returns the steps with the lowest pass rate, ranked ascending.

get_worst_steps(testName="FinalTest", top=5)
get_worst_steps(testName="FinalTest", stationId="station-A", from="2025-06-01T00:00:00Z")

Use this tool to focus engineering attention: a step with 60 % yield is a stronger cost-reduction candidate than improving a 95 % step.

Unit history (`get_unit_history`)

Returns all test runs for a serial number, newest first.

get_unit_history(serialNumber="SN-1042")
get_unit_history(serialNumber="SN-1042", page=2, pageSize=10)

Use this tool to answer "why did this unit fail?" or to reconstruct the full test history for a warranty return.

Caveat: specification limit changes split the population

This section explains what happens when lowerLimit or upperLimit are changed in the test YAML and the package is redeployed.

What happens at the database level

Maestro uses content-hash identity for measurement definitions. Every time a test package is loaded, the YAML content of each measurement block is hashed. If the hash already exists in MeasurementDefinitions, the existing row is reused. If the hash is new (because limits changed), a new row is inserted with the same name but a different Id and different lowerLimit/upperLimit.

Values recorded before the change point at the old definition Id. Values recorded after point at the new definition Id. Both rows have the same name, but querying by Id returns non-overlapping populations.

What this means for statistics

Before this fix was implemented, get_measurement_statistics used FirstOrDefault when resolving the definition — it returned a database-order-dependent row, silently querying only the samples attached to that one definition. A programme with limits tightened at month 3 of 6 could return half the true sample population with no warning.

The endpoint now:

Loads all definition rows for the name.
Queries values from all of them (the full population).
Identifies the most recently active definition (the one whose values carry the latest timestamp) and uses its limits for lowerLimit, upperLimit, Cp, and Cpk.
Returns definitionVersions (the count) and limitsNote (a plain-English explanation when definitionVersions > 1).

How to use the response correctly

Step 1 — check definitionVersions

{ "definitionVersions": 2, "sampleCount": 1924, ... }

If definitionVersions is 1, nothing special. If it is 2 or more, proceed to step 2.

Step 2 — read and show limitsNote verbatim

{
  "limitsNote": "Specification limits for 'VOUT_5V' changed 1 time(s) during the programme,
    creating 2 definition versions. All 1924 samples are pooled across all versions.
    The lowerLimit, upperLimit, and Cp/Cpk values reflect the most recently active
    specification only. Samples recorded under older specifications may have been
    accepted or rejected against different limits."
}

Always present this text to the user before showing Cp/Cpk. Do not paraphrase or omit it. The user needs to know that some historical samples were gated against different limits before deciding whether the pooled Cp/Cpk is meaningful for their purpose.

Step 3 — decide whether the pooled statistics answer the question

Ask: "Was the limit change intentional process knowledge (tighter spec, same process) or a correction of an error (wrong limits deployed for a period)?"

Scenario	Appropriate action
Limits were tightened as the process matured	Pooled statistics are conservative — Cp/Cpk may be lower than the process actually achieves under the current spec. Consider adding a `from` date filter to isolate the current-spec period.
An incorrect limit was deployed by mistake	The pooled sampleCount is inflated. Use `from`/`to` to filter to the correct-spec period only.
Limits were loosened (relaxed)	Some historical pass/fail verdicts no longer reflect the current gate. The `passCount`/`failCount` in the response reflects the verdict at the time of measurement, not re-evaluated against current limits.

Step 4 — narrow the sample set if needed

If the user needs statistics for the current specification only, filter by date:

get_measurement_statistics(
  measurementName="VOUT_5V",
  from="<date the new spec was deployed>T00:00:00Z"
)

This restricts TestExecution.StartTime to after the change, effectively isolating the current-spec population.

How to present data from a multi-version measurement to users

When reporting on a measurement where definitionVersions > 1, a well-formed response looks like this:

VOUT_5V — 1924 samples pooled across 2 specification versions.

Mean: 4.972 V | StdDev: 0.031 V | Min: 4.891 V | Max: 5.048 V

Current specification: 4.75–5.25 V → Cp: 1.07, Cpk: 0.96 (marginally capable)

⚠️ Note: Specification limits changed once during the programme. These statistics cover the full population (both before and after the change). Cp/Cpk reflect the current specification only. If you need statistics for the current spec period alone, filter using the from parameter with the date the new spec was deployed.

If the user asks a direct question like "Is VOUT_5V capable?", the correct answer is:

"Based on the pooled population of 1924 samples, Cpk is 0.96 against the current spec — marginally not capable. However, the limits were changed once during the programme, so some of these samples were gated against a different specification. To get a precise answer for the current spec, provide the date from which the current limits have been in use."

Never answer "Yes, it is capable" or "No, it is not capable" without disclosing definitionVersions > 1 when it applies.

Example AI prompts

The following prompts work well with Claude, GitHub Copilot, or any MCP-compatible client connected to a running Maestro MCP server.

Yield analysis

What was the yield for FinalTest last month?

Which step in FinalTest failed most often in the last 30 days?

Show me the yield trend for FinalTest week by week since January.

Measurement name discovery

What measurement points are available on this station?

List all voltage-related measurements I can analyse.

Measurement quality

Is OutputVoltage_mV in control for the PowerSupplyTest? Give me Cp and Cpk.

What were the min, max, and mean for ContactResistance_Ohm last week?

Which measurements have a Cpk below 1.33? Start by listing all measurement names.

Show me all measurements that are not process capable (use get_capability_summary).

Which measurements are actively producing defects (Cpk < 1.00)?

Unit traceability

What tests has serial number SN-1042 completed? Did it ever fail?

Background service

StatisticsBackgroundService (in WorkflowEngine.Orchestra) continuously pre-computes statistics so that API responses are fast even for large datasets.

Behaviour	Detail
Poll interval	60 seconds
Processing guard	Only processes executions completed > 5 minutes ago
Watermark	Uses the latest `ComputedAt` timestamp in `stat_cache` as a cursor to avoid reprocessing old data
Cache validity	API controller serves cached data if less than 60 minutes old
Upsert strategy	Uses EF `FirstOrDefault` + update; unique index on `(TestName, StepName, MeasurementName, StationId, BucketType, BucketDate)` prevents duplicates

The service writes the following row types per test definition:

Row type	`StepName`	`MeasurementName`	Description
Test yield — fleet-wide	null	null	Overall pass rate, all stations
Test yield — per station	null	null	Pass rate per `StationId`
Step yield — fleet-wide	set	null	Pass rate per step
Measurement statistics	null	set	Mean, σ, Cp, Cpk per measurement

Troubleshooting

get_yield returns totalRuns: 0

Check that testName matches the name field in the test YAML exactly (case-sensitive).
The stat_cache may not yet have data — wait up to 65 seconds after the first execution completes.

sampleCount is lower than expected

Check definitionVersions. If it is > 1, the pooling is working and the count is correct — all versions are included. Re-read limitsNote and consider whether a from date filter is appropriate.
If definitionVersions is 1 and the count still seems low, check the from/to date filter and verify the testName / stationId filters are not over-restricting.

cp and cpk are null

The current measurement definition has no lowerLimit / upperLimit in the YAML.
Add low_limit and high_limit to the measurement definition and re-run a test.
If definitionVersions > 1: prior versions may have had limits but the current one does not. Check the YAML and confirm limits are present in the deployed version.

Statistics are stale

Check that WorkflowEngine.Orchestra is running and healthy.
Inspect Orchestra logs for StatisticsBackgroundService error messages.
Measurements from executions completed less than 5 minutes ago will not appear yet.

passCount seems wrong when limits changed

passCount and failCount reflect the verdict recorded at the time of each measurement — i.e., evaluated against the limits that were active when the test ran.
They are not re-evaluated when limits change. This is by design: the verdict is a historical fact, not a projection.
If you need to re-evaluate historical samples against the current spec, that requires a custom query against the raw MeasurementValues table.

Log search (`search_execution_logs`)

Finds log text from past executions using PostgreSQL full-text search. For full details see live-events.md.

search_execution_logs(query="timeout error")
search_execution_logs(query="calibration failure", testExecutionId="<uuid>")
search_execution_logs(query="drift", stepName="RF Calibration")

Indexing latency: Logs are indexed by EmbeddingBackgroundService in Orchestra 5–65 seconds after execution completion. If totalHits is 0 for a very recent execution, wait and retry.