product: maestro audience: test-developer, operator, ai-assistant authority: normative

Troubleshooting Maestro with MCP Diagnostic Tools

When a user reports unexpected Maestro behaviour — a runner that won't respond, a test that fails for no obvious reason, pip packages that won't install, or the UI showing a station as offline — always run the diagnostic tools below before guessing or asking the user to check logs manually.

These tools give you direct, machine-readable visibility into the running stack without requiring SSH access, Docker CLI access, or manual log inspection.

Diagnostic Decision Tree

Step 1 — Is the station reachable at all?

get_system_health

database.healthy: false → PostgreSQL is down or unreachable. The station cannot store results and tests will not start.
redis.healthy: false → Redis is down. SignalR and background jobs will fail.
Both healthy → infrastructure is fine; move to Step 2.

Step 2 — What versions are running?

get_system_version

runners: [] (empty) → no runners have registered. This means the dotnet-runner and/or python-runner containers failed to start or cannot reach the API. Proceed immediately to Step 3.
Runner listed as healthy: false → runner is registered but not responding to health checks. Proceed to Step 3 for that runner.
Versions look unexpected → the stack may not have been updated. Run get_update_status and trigger_system_update if needed.

Step 3 — What did the container log before failing?

get_service_logs  service="python-runner"   tail=50
get_service_logs  service="dotnet-runner"   tail=50
get_service_logs  service="api"             tail=50
get_service_logs  service="orchestra"       tail=50

Read the tail logs for each affected service. Common findings:

Log pattern	Likely cause
`pipPackages declared … AllowOnlinePip is false`	pip dependencies will not install; `AllowOnlinePip` station config key is missing or `false`
`pip install … error` / `No matching distribution`	transitive pip dependency failure; check for incompatible vendored wheels
`Address already in use`	port conflict; another process is using the runner port
`Connection refused` to postgres/redis	infrastructure not yet healthy when runner started
`OOM` / process killed	container ran out of memory
`ModuleNotFoundError`	Python package not installed; dependency installer failed silently

If get_service_logs returns unavailable: true, the /station-logs volume is not mounted. This is a deployment configuration issue — see the deployment guide for the required docker-compose volume entries.

Step 4 — Did any container crash or restart?

get_system_events  minutes=60

Look for die, oom, or rapid start→die cycles on any container. A container that keeps restarting will appear in get_system_version as intermittently healthy/unhealthy.

Pair with get_service_logs to read what the process logged just before the crash.

If get_system_events returns unavailable: true, the Docker socket /var/run/docker.sock is not mounted in the API container. This is a deployment configuration issue.

Step 5 — Is a specific test execution failing?

get_execution_logs  executionId="<id>"

or for structured output with step names:

get_execution_logs  executionId="<id>"
get_step_results    executionId="<id>"  page=1  pageSize=100

Look for the first FAIL step — that is where execution stopped or deviated.
Log lines beginning with pip dependencies … could not be installed → pip issue; cross-reference with get_service_logs python-runner.
Log lines containing TypeError, AttributeError, ImportError → Python code error in the test package itself.
Connection refused / timeout in step logs → hardware or instrument not reachable from the runner container.

Use search_execution_logs for keyword search across many executions:

search_execution_logs  query="pip error"
search_execution_logs  query="ImportError"
search_execution_logs  query="timeout"

Step 6 — Is station configuration correct?

get_merged_config  stationId="<station-id>"

Verify that required config keys are present and have the right values. Common keys to check:

Key	Expected value	Effect if wrong
`AllowOnlinePip`	`true` (dev) / absent (prod)	pip falls back to PyPI only when `true`
`AccordionIpAddress`	valid IP of the Accordion hub	hardware steps time out
`COM_PORT`	correct serial port	serial instrument steps fail

Symptom Quick Reference

User says…	Start with…
"runner is unavailable" / "no runners registered"	`get_system_version` → `get_service_logs` → `get_system_events`
"test fails immediately on step 1"	`get_execution_logs` → `get_service_logs python-runner`
"pip package won't install"	`get_service_logs python-runner` → `get_merged_config` (check `AllowOnlinePip`)
"station shows offline in dashboard"	`get_system_health` → `get_system_events`
"test was passing yesterday, failing today"	`get_system_events` (crash/restart?) → `get_system_version` (update?) → `get_execution_logs`
"UI is not updating / stuck"	`get_system_health` (Redis?) → `get_service_logs api`
"update seems stuck"	`get_update_status` → `get_system_events`
"wrong package version running"	`list_packages` → `get_system_version`

Filing a Bug Report

Once you have diagnosed the problem using the steps above, file it directly from the AI session — no separate issue tracker needed:

maestro_bug_report
  title       = "<concise one-line summary>"
  reportedBy  = "<your name or station ID>"
  description = "<paste the relevant tool output here>"
  severity    = "high" | "medium" | "low" | "critical"

Collect this context before calling maestro_bug_report so the description is complete:

get_station_info — station ID and versions
get_system_health — infrastructure state
get_system_events minutes=120 — recent container events
get_service_logs service="api" tail=100
get_service_logs service="python-runner" tail=100 (if Python-related)
get_service_logs service="dotnet-runner" tail=100 (if .NET-related)
get_execution_logs executionId="<failing run>" (if a specific test failed)

See tools-feedback.md for full parameter reference and severity guidance.