Live demo — data resets daily at 03:00 UTC. Nothing you enter is saved. Server UI →

product: maestro audience: test-developer, operator, ai-assistant authority: normative

MCP Server — Statistics Tools

The statistics tools let an AI assistant (or a human operator via the AI chat interface) answer yield, quality, and traceability questions about test data without writing SQL or opening a dashboard.


Architecture overview

AI assistant
    │  get_yield / get_measurement_statistics / …
    ▼
WorkflowEngine.McpServer  ──HTTP──▶  WorkflowEngine.Api
                                         │  GET /api/statistics/*
                                         ▼
                                    stat_cache table  (PostgreSQL)
                                         ▲
                               StatisticsBackgroundService
                               (WorkflowEngine.Orchestra, 60 s poll)
                                         │  reads
                                    TestExecutions / StepExecutions /
                                    MeasurementValues tables

Hot path: The API controller checks the stat_cache table first (valid for 60 minutes). If the cache is warm it returns in milliseconds. If it is stale or absent it falls back to a live query over the execution tables.

Cache writer: StatisticsBackgroundService runs in Orchestra, polls every 60 seconds, and writes pre-computed yield and Cp/Cpk rows for every test definition whose executions changed since the last run. It only processes executions that completed more than 5 minutes ago to avoid racing with in-flight writes.


Tools reference

See tools-reference.md for the full parameter table. Quick summary:

Tool Question it answers
list_measurement_names "What measurement points are available on this station?"
get_capability_summary "Which measurements have poor Cpk across the whole programme?"
get_yield "What is the pass rate for FinalTest this week?"
get_measurement_statistics "Is the OutputVoltage measurement process capable (Cp/Cpk ≥ 1.33)?"
get_yield_trend "How has yield trended over the last 3 months?"
get_worst_steps "Which steps fail most often?"
get_unit_history "What tests has serial number SN-1042 been through?"

Yield (get_yield)

Returns pass rate for a test or a single step within a test.

get_yield(testName="FinalTest")
get_yield(testName="FinalTest", stepName="RF Calibration", from="2025-01-01T00:00:00Z")
get_yield(testName="FinalTest", stationId="station-A")

Response fields:

Field Type Description
testName string Echoed from request
stepName string? Null for test-level yield
stationId string? Null for fleet-wide aggregate
totalRuns int Total executions in the date range
passCount int Executions with verdict PASS
failCount int totalRuns - passCount
yieldPercent double passCount / totalRuns × 100, rounded to 2 d.p.
from / to DateTimeOffset? Echoed from request

Measurement name discovery (list_measurement_names)

Before querying statistics you need the exact measurementName as it is registered in the test definition. Use list_measurement_names to discover what names exist.

Results are deduplicated by name — each distinct name appears once.

list_measurement_names()                           # all names (up to 50)
list_measurement_names(search="VOUT")              # all names containing "VOUT"
list_measurement_names(search="resistance", limit=100)

Response fields per entry:

Field Description
name Exact name to pass to get_measurement_statistics
lowerLimit Lower specification limit (indicative — from one of potentially several versions)
upperLimit Upper specification limit (indicative)
units Engineering unit, e.g. mV, Ω, °C
definitionVersions Number of distinct specification versions for this name
sampleCount Pre-computed sample count from stat_cache; null if no executions yet
cp Pre-computed Cp from stat_cache; null if no limits, σ = 0, or no samples
cpk Pre-computed Cpk from stat_cache; null if no limits, σ = 0, or no samples

The sampleCount, cp, and cpk fields come from the stat_cache and allow callers to identify poor-capability measurements without making N follow-up calls. They are null for measurements with no executions yet. For measurements with definitionVersions > 1 these values reflect the most-recently-computed cache entry rather than pooled statistics — always follow up with get_measurement_statistics before acting on Cp/Cpk in that case.

To find all measurements below a Cpk threshold in one call, use get_capability_summary instead.

What definitionVersions tells you:

Every time the low_limit, high_limit, or unit fields are changed in the test YAML and the package is redeployed, a new definition row is inserted with the same name but a different content hash. definitionVersions counts those rows.

  • definitionVersions = 1 — the specification has never changed. Statistics are straightforward: all samples were recorded against the same limits, and Cp/Cpk is directly comparable across the entire date range.

  • definitionVersions > 1 — the specification was changed at least once. This has consequences for how statistics should be interpreted. Read the Caveat: limit changes section before proceeding.


Capability summary (get_capability_summary)

Returns all measurement points whose Cpk is below a given threshold, sorted worst-first. Served entirely from the pre-computed stat_cache — one call replaces the N+1 pattern of calling list_measurement_names followed by get_measurement_statistics for every entry when scanning a large population for quality problems.

get_capability_summary()                                       # all with Cpk < 1.33
get_capability_summary(maxCpk=1.00)                           # only actively defective
get_capability_summary(maxCpk=1.33, testName="FinalTest")     # scoped to one test
get_capability_summary(maxCpk=1.67, stationId="station-A")    # scoped to one station

When to use vs get_measurement_statistics:

Use case Tool
"Which measurements are poorly capable?" (fleet scan) get_capability_summary
"Is VOUT_5V capable? Give me full detail." get_measurement_statistics
"What measurements exist?" list_measurement_names

Response fields per item:

Field Description
measurementName Measurement point name
testName Test definition this entry came from
stationId Station — null for fleet-wide cache entries
sampleCount Number of numeric samples in the cache row
mean / stdDev / min / max Descriptive statistics
lowerLimit / upperLimit / units Specification from the cache entry
cp / cpk Capability indices (always non-null in this response — entries with null Cpk are excluded)
definitionVersions > 1 signals that limits changed; follow up with get_measurement_statistics for pooled analysis
computedAt UTC timestamp when this cache row was last written

Important: When a measurement name appears in multiple tests, only the row with the lowest Cpk (worst case) is returned. The testName field identifies which test produced it.

definitionVersions caveat: The stat_cache stores per-definition statistics rather than pooled. For definitionVersions > 1, the Cp/Cpk here is for one definition version only. Use get_measurement_statistics (which pools all versions) for the authoritative figure before concluding the process is or is not capable.


Measurement statistics (get_measurement_statistics)

Returns descriptive statistics and process capability indices for a named numeric measurement point. Non-numeric measurement points (boolean, string) are ignored.

Samples are always pooled across all specification versions of the name. This means sampleCount always reflects the complete historical population — there is no silent truncation when limits changed.

Typical workflow:

  1. Call list_measurement_names(search="...") to find the exact name and check definitionVersions.
  2. If definitionVersions > 1, read the caveat below before interpreting results.
  3. Call get_measurement_statistics(measurementName="...") to get statistics.
  4. If limitsNote is non-null, show it to the user verbatim before presenting Cp/Cpk.
  5. Optionally narrow the sample set with testName, stationId, from, to.
get_measurement_statistics(measurementName="OutputVoltage_mV")
get_measurement_statistics(measurementName="OutputVoltage_mV", testName="PowerSupplyTest", from="2025-06-01T00:00:00Z")
get_measurement_statistics(measurementName="VOUT_5V", stationId="station-A")

Cp and Cpk explained

Process capability indices answer the question: "Given how much this measurement naturally varies, how likely is it to stay within specification?"

Cp measures how wide the specification window is relative to the natural process spread:

Cp = (USL − LSL) / (6σ)

A Cp of 1.0 means the specification window is exactly 6σ wide — in a perfectly centred process, 99.73 % of parts would pass. A Cp of 1.33 gives a safety margin of ±4σ on each side (standard production target).

Cpk additionally penalises a process that is off-centre — it tells you the actual margin to the nearest limit:

Cpk = min( (USL − μ) / (3σ),  (μ − LSL) / (3σ) )

If Cp > Cpk the process is off-centre. A process can have a high Cp (wide spec) but a low Cpk (mean drifted close to one limit), which will produce failures even though the tolerance is generous.

Value Interpretation
< 1.00 Process is not capable — defects expected even with a centred process
1.00 – 1.33 Marginally capable — monitor closely, little room for drift
≥ 1.33 Capable — standard production target
≥ 1.67 Highly capable — Six-Sigma territory

When Cp/Cpk are applicable:

  • Both lowerLimit and upperLimit must be defined in the test YAML (measurement: block).
  • At least a few dozen samples are needed for meaningful indices (σ becomes unstable below ~30 samples).
  • σ must be > 0 (all-identical samples indicate a stuck sensor, not a real process).

Both values are null when any of these conditions are not met.

How to use the results:

Observation Suggested action
Cp ≥ 1.33, Cpk ≥ 1.33 Process is well-controlled — no action needed
Cp ≥ 1.33, Cpk < 1.00 Spec is wide enough but the mean has drifted — re-centre the process
Cp < 1.00 Process spread is too wide — tighten the process, not the spec
cp / cpk are null Add low_limit / high_limit to the measurement YAML, or investigate σ = 0

Cp and Cpk are null when:

  • The measurement definition has no lower or upper limit, or
  • σ = 0 (all samples are identical — likely a fixture or software issue)

Response fields:

Field Type Description
measurementName string Measurement definition name
sampleCount int Total numeric samples across all definition versions
mean double Arithmetic mean of all pooled samples
stdDev double Population standard deviation of all pooled samples
min / max double Observed extremes across all pooled samples
lowerLimit / upperLimit double? Specification limits from the most recently active definition
units string? Engineering unit from the most recently active definition
cp double? Process capability computed against current limits (null if limits unavailable or σ = 0)
cpk double? Process capability index computed against current limits
passCount / failCount int Verdicts on individual measurement values across all versions
definitionVersions int Number of distinct specification versions for this name
limitsNote string? Non-null when definitionVersions > 1. Explains the pooling and which spec was used. Always show to the user.

Yield trend (get_yield_trend)

Returns yield bucketed by day, week, or month.

get_yield_trend(testName="FinalTest", bucket="week", from="2025-01-01T00:00:00Z")

Each point in trend.points[]:

Field Description
bucketStart UTC start of the bucket
bucketLabel Human-readable label, e.g. "W23 2025", "Jun 2025", "2025-06-04"
totalRuns Executions in this bucket
passCount Passing executions
yieldPercent Pass rate for this bucket

Worst steps (get_worst_steps)

Returns the steps with the lowest pass rate, ranked ascending.

get_worst_steps(testName="FinalTest", top=5)
get_worst_steps(testName="FinalTest", stationId="station-A", from="2025-06-01T00:00:00Z")

Use this tool to focus engineering attention: a step with 60 % yield is a stronger cost-reduction candidate than improving a 95 % step.


Unit history (get_unit_history)

Returns all test runs for a serial number, newest first.

get_unit_history(serialNumber="SN-1042")
get_unit_history(serialNumber="SN-1042", page=2, pageSize=10)

Use this tool to answer "why did this unit fail?" or to reconstruct the full test history for a warranty return.


Caveat: specification limit changes split the population

This section explains what happens when lowerLimit or upperLimit are changed in the test YAML and the package is redeployed.

What happens at the database level

Maestro uses content-hash identity for measurement definitions. Every time a test package is loaded, the YAML content of each measurement block is hashed. If the hash already exists in MeasurementDefinitions, the existing row is reused. If the hash is new (because limits changed), a new row is inserted with the same name but a different Id and different lowerLimit/upperLimit.

Values recorded before the change point at the old definition Id. Values recorded after point at the new definition Id. Both rows have the same name, but querying by Id returns non-overlapping populations.

What this means for statistics

Before this fix was implemented, get_measurement_statistics used FirstOrDefault when resolving the definition — it returned a database-order-dependent row, silently querying only the samples attached to that one definition. A programme with limits tightened at month 3 of 6 could return half the true sample population with no warning.

The endpoint now:

  1. Loads all definition rows for the name.
  2. Queries values from all of them (the full population).
  3. Identifies the most recently active definition (the one whose values carry the latest timestamp) and uses its limits for lowerLimit, upperLimit, Cp, and Cpk.
  4. Returns definitionVersions (the count) and limitsNote (a plain-English explanation when definitionVersions > 1).

How to use the response correctly

Step 1 — check definitionVersions

{ "definitionVersions": 2, "sampleCount": 1924, ... }

If definitionVersions is 1, nothing special. If it is 2 or more, proceed to step 2.

Step 2 — read and show limitsNote verbatim

{
  "limitsNote": "Specification limits for 'VOUT_5V' changed 1 time(s) during the programme,
    creating 2 definition versions. All 1924 samples are pooled across all versions.
    The lowerLimit, upperLimit, and Cp/Cpk values reflect the most recently active
    specification only. Samples recorded under older specifications may have been
    accepted or rejected against different limits."
}

Always present this text to the user before showing Cp/Cpk. Do not paraphrase or omit it. The user needs to know that some historical samples were gated against different limits before deciding whether the pooled Cp/Cpk is meaningful for their purpose.

Step 3 — decide whether the pooled statistics answer the question

Ask: "Was the limit change intentional process knowledge (tighter spec, same process) or a correction of an error (wrong limits deployed for a period)?"

Scenario Appropriate action
Limits were tightened as the process matured Pooled statistics are conservative — Cp/Cpk may be lower than the process actually achieves under the current spec. Consider adding a from date filter to isolate the current-spec period.
An incorrect limit was deployed by mistake The pooled sampleCount is inflated. Use from/to to filter to the correct-spec period only.
Limits were loosened (relaxed) Some historical pass/fail verdicts no longer reflect the current gate. The passCount/failCount in the response reflects the verdict at the time of measurement, not re-evaluated against current limits.

Step 4 — narrow the sample set if needed

If the user needs statistics for the current specification only, filter by date:

get_measurement_statistics(
  measurementName="VOUT_5V",
  from="<date the new spec was deployed>T00:00:00Z"
)

This restricts TestExecution.StartTime to after the change, effectively isolating the current-spec population.

How to present data from a multi-version measurement to users

When reporting on a measurement where definitionVersions > 1, a well-formed response looks like this:

VOUT_5V — 1924 samples pooled across 2 specification versions.

Mean: 4.972 V | StdDev: 0.031 V | Min: 4.891 V | Max: 5.048 V

Current specification: 4.75–5.25 V → Cp: 1.07, Cpk: 0.96 (marginally capable)

⚠️ Note: Specification limits changed once during the programme. These statistics cover the full population (both before and after the change). Cp/Cpk reflect the current specification only. If you need statistics for the current spec period alone, filter using the from parameter with the date the new spec was deployed.

If the user asks a direct question like "Is VOUT_5V capable?", the correct answer is:

"Based on the pooled population of 1924 samples, Cpk is 0.96 against the current spec — marginally not capable. However, the limits were changed once during the programme, so some of these samples were gated against a different specification. To get a precise answer for the current spec, provide the date from which the current limits have been in use."

Never answer "Yes, it is capable" or "No, it is not capable" without disclosing definitionVersions > 1 when it applies.


Example AI prompts

The following prompts work well with Claude, GitHub Copilot, or any MCP-compatible client connected to a running Maestro MCP server.

Yield analysis

What was the yield for FinalTest last month?
Which step in FinalTest failed most often in the last 30 days?
Show me the yield trend for FinalTest week by week since January.

Measurement name discovery

What measurement points are available on this station?
List all voltage-related measurements I can analyse.

Measurement quality

Is OutputVoltage_mV in control for the PowerSupplyTest? Give me Cp and Cpk.
What were the min, max, and mean for ContactResistance_Ohm last week?
Which measurements have a Cpk below 1.33? Start by listing all measurement names.
Show me all measurements that are not process capable (use get_capability_summary).
Which measurements are actively producing defects (Cpk < 1.00)?

Unit traceability

What tests has serial number SN-1042 completed? Did it ever fail?

Background service

StatisticsBackgroundService (in WorkflowEngine.Orchestra) continuously pre-computes statistics so that API responses are fast even for large datasets.

Behaviour Detail
Poll interval 60 seconds
Processing guard Only processes executions completed > 5 minutes ago
Watermark Uses the latest ComputedAt timestamp in stat_cache as a cursor to avoid reprocessing old data
Cache validity API controller serves cached data if less than 60 minutes old
Upsert strategy Uses EF FirstOrDefault + update; unique index on (TestName, StepName, MeasurementName, StationId, BucketType, BucketDate) prevents duplicates

The service writes the following row types per test definition:

Row type StepName MeasurementName Description
Test yield — fleet-wide null null Overall pass rate, all stations
Test yield — per station null null Pass rate per StationId
Step yield — fleet-wide set null Pass rate per step
Measurement statistics null set Mean, σ, Cp, Cpk per measurement

Troubleshooting

get_yield returns totalRuns: 0

  • Check that testName matches the name field in the test YAML exactly (case-sensitive).
  • The stat_cache may not yet have data — wait up to 65 seconds after the first execution completes.

sampleCount is lower than expected

  • Check definitionVersions. If it is > 1, the pooling is working and the count is correct — all versions are included. Re-read limitsNote and consider whether a from date filter is appropriate.
  • If definitionVersions is 1 and the count still seems low, check the from/to date filter and verify the testName / stationId filters are not over-restricting.

cp and cpk are null

  • The current measurement definition has no lowerLimit / upperLimit in the YAML.
  • Add low_limit and high_limit to the measurement definition and re-run a test.
  • If definitionVersions > 1: prior versions may have had limits but the current one does not. Check the YAML and confirm limits are present in the deployed version.

Statistics are stale

  • Check that WorkflowEngine.Orchestra is running and healthy.
  • Inspect Orchestra logs for StatisticsBackgroundService error messages.
  • Measurements from executions completed less than 5 minutes ago will not appear yet.

passCount seems wrong when limits changed

  • passCount and failCount reflect the verdict recorded at the time of each measurement — i.e., evaluated against the limits that were active when the test ran.
  • They are not re-evaluated when limits change. This is by design: the verdict is a historical fact, not a projection.
  • If you need to re-evaluate historical samples against the current spec, that requires a custom query against the raw MeasurementValues table.

Log search (search_execution_logs)

Finds log text from past executions using PostgreSQL full-text search. For full details see live-events.md.

search_execution_logs(query="timeout error")
search_execution_logs(query="calibration failure", testExecutionId="<uuid>")
search_execution_logs(query="drift", stepName="RF Calibration")

Indexing latency: Logs are indexed by EmbeddingBackgroundService in Orchestra 5–65 seconds after execution completion. If totalHits is 0 for a very recent execution, wait and retry.

An unhandled error has occurred. Reload 🗙

Rejoining the server...

Rejoin failed... trying again in seconds.

Failed to rejoin.
Please retry or reload the page.

The session has been paused by the server.

Failed to resume the session.
Please retry or reload the page.