Last reviewed: 2026-06-19
Language: English | Russian
Query Doctor is a local-first Big Data query diagnostic tool focused today on Apache Impala production triage. It helps operators rank suspicious Recent queries, collect bounded profile context, derive deterministic evidence, and generate validated reports without exposing raw SQL or raw profiles in trusted browser/report surfaces.
Core rule:
Python owns facts. LLM owns wording only.
Recent scan is the flagship workflow. Query ID diagnosis is secondary for one known Impala query, with a local Trino Beta lane for bounded retained-list Recent diagnosis and one explicit Query ID when the required coordinator contracts are configured. Query Optimizer is separate, read-only, and does not execute or echo submitted SQL.
python -m pip install query-doctor
query-doctor-self-test
query-doctor-analyze \
--profile-text ./exported-impala-profile.txt \
--out cases/cm-corpus
query-doctor-web --corpus-dir cases/cm-corpusRun query-doctor-self-test after installation to verify the installed console
scripts, synthetic demo generation, one-profile analysis, local web rendering,
Impala Web UI filename fallback, deterministic report generation, and corpus
smoke path. It uses synthetic local data only and does not contact Cloudera
Manager, impalad, Spark, Trino, Prometheus, Ollama, or external LLM services.
Package and release CI also run a README Quickstart smoke against a clean wheel
install: query-doctor-self-test, query-doctor-analyze --profile-text ./exported-impala-profile.txt --out cases/cm-corpus, and
query-doctor-web --corpus-dir cases/cm-corpus.
The profile-analysis path needs one exported Impala text profile and no
Cloudera Manager, Kerberos, local config, browser upload, Prometheus, or LLM. A
direct Impala Web UI download named profile_<query-id-high>_<query-id-low> can
be used as-is. The web UI opens staged cases from --corpus-dir automatically.
See Pick A First Path for the demo and Cloudera Manager
options.
Query Doctor is:
- a local-first Impala production triage workbench;
- a deterministic evidence extractor;
- a Recent-query ranking workflow for operators and administrators;
- a safe report generator using validated facts;
- a practical tool for deciding what to inspect, change, and verify next;
- a Big Data SQL/lakehouse diagnostics wedge whose production triage engine is Apache Impala today, with bounded raw-free future-engine preview seams.
Query Doctor is not:
- a generic AI chatbot over raw profiles;
- a replacement for the Impala Web UI;
- a tool that executes user SQL or optimizer draft SQL;
- a tool that sends raw SQL/profile data to remote services by default;
- a root-cause oracle;
- a live multi-engine query collector today.
- Turns one exported Apache Impala text profile into a local deterministic diagnosis without Cloudera Manager, Kerberos, metadata, Prometheus, browser upload, or an LLM provider.
- Scans completed Recent queries as the flagship production workflow, with Running queries and one explicit Known Query ID as focused secondary modes.
- Works with Cloudera Manager when available, or with bounded direct Impala daemon endpoints for non-Cloudera-Manager Impala clusters.
- Optionally adds bounded Prometheus runtime summaries for direct Impala
workflows and bounded read-only Impala metadata through
impala-shell. - Ranks suspicious cases and action candidates from deterministic analyzer facts, not LLM scoring.
- Presents Details as an analyst decision page: why the query matters, where to inspect, what to try, how to verify a comparable rerun, and what evidence is missing.
- Folds validated selected-case optimizer guidance into the same Recommended change area when available. Report generation remains an explicit selected-case action, and optimizer generation is offered only when deterministic rewrite support marks the case safe to attempt.
- Generates trusted reports only after deterministic normalization, sanitization, and validation.
- Provides a separate read-only Query Optimizer workflow for pasted SQL review, plus explicit selected-case optimizer actions for server-owned analyzed cases.
- Keeps raw SQL, raw profiles, raw metadata, local paths, secrets, subprocess output, model/runtime internals, and raw artifact filenames out of browser and trusted report surfaces.
| Surface | Current status |
|---|---|
| Query engine | Apache Impala is the production triage engine. |
| First-value intake | One local exported Impala text profile can be staged, redacted, analyzed, and opened from Known Query ID. |
| Recent scan | Cloudera Manager is the full Recent discovery/profile/metrics/events provider for Impala workflows. |
| Direct Impala | Bounded Recent scans, Running scans, and one Known Query ID through impalad daemon endpoints; no Cloudera Manager events and no SQL execution. |
| Runtime metrics | Optional bounded Prometheus summaries for configured direct Impala workflows; no arbitrary PromQL from users. |
| Metadata | Read-only allowlisted Impala metadata statements through impala-shell; no user SQL execution or unbounded metadata crawl. |
| Reports and optimizer | Python-owned facts and validation. Known Query ID prepares the deterministic Python report in its explicit submit job; LLM narratives remain explicit selected-case actions, and optimizer actions are shown only for cases with safe-to-attempt rewrite support. |
| Trino Beta | Local web beta can read one bounded retained pruned coordinator query list for Recent diagnosis, then bounded pruned coordinator QueryInfo payloads for selected rows or one explicit Query ID, and render deterministic compact diagnosis. No Running scans, query-history crawling, metadata collection, Details/trusted report output, optimizer behavior, generated Trino SQL, SQL execution, or production support. |
| Spark | Bounded compact support surfaces only. Spark is not production engine support, live Recent scans, Details/trusted report output, optimizer behavior, raw event-log handling, Spark job execution, or Query Doctor-generated SQL. |
Trino preview surfaces include offline or compact raw-free imports and checks:
sanitized evidence packages, bounded local compact imports, explicit
source-contract checks, and bounded pruned QueryInfo paths documented in the
engine docs. The only product-facing Trino surfaces are local web Trino Beta
retained-list Recent diagnosis and One Query ID diagnosis. Both require
trino_beta_enabled=true, trino_coordinator_url, and
trino_query_info_source_contract in local config; Recent also requires
trino_query_list_source_contract. Startup validation checks local source
contracts, safe coordinator URL shape, and optional auth reference
(trino_auth_header_file or local Kerberos/SPNEGO settings) before the lane is
marked configured. Configured Trino Beta sources are marked as
Trino Beta Recent + One Query ID or Trino Beta One Query ID in the source
selector. The Diagnose Engine control narrows the Source cluster selector to
Impala-capable sources or Trino Beta-ready sources before workflow selection,
and stale or forged Trino submits still fail closed before analysis or async
job creation.
Coordinator URL, auth header references, raw QueryInfo, raw SQL, and local paths
stay out of the browser. Broader Trino live collection remains unsupported.
Spark compact support surfaces are limited to bounded compact History Server
intake, compact evidence-package build/validation, and compact diagnosis; there
is no public Spark engine support.
Future Big Data SQL/lakehouse live collectors, broader providers, prepared event/log sources, and Cluster Doctor workflows remain roadmap seams, not current support. For the detailed Trino and Spark preview command catalog, use docs/engines/README.md and docs/engine-support-gap-matrix.md.
Direct Impala Recent and Running scans currently see only the query history
exposed by the configured coordinator daemon query-list endpoints. Upstream
Impala keeps the coordinator query log at --query_log_size=200 entries by
default, further bounded by --query_log_size_in_bytes. Operators who need
deeper direct history can increase those Impala daemon settings on each
coordinator, while watching coordinator Web UI memory and /queries response
latency. Future deeper-history options are deliberately separate sources:
operator-managed read-only profile-log directory ingestion, or bounded external
history sources such as Loki or OpenSearch. They require explicit source
contracts, allowlists, byte/window bounds, and raw-free browser/report output;
the current product does not read coordinator filesystems, pod filesystems, or
external log indexes for direct Recent scans.
Apache Impala also has upstream work around native AI query profile analysis. Query Doctor aligns with that direction by staying focused on local-first production triage across many queries, deterministic evidence, safe enrichment, and validated raw-free reports. See docs/upstream-impala-ai-analyzer.md.
Install the current public package from PyPI:
python3 -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install query-doctor
query-doctor-self-testquery-doctor-self-test is the installed-package confidence check. It uses
synthetic local data to exercise packaged console scripts, one-profile analysis,
Impala Web UI filename fallback, local web rendering, deterministic reports,
and corpus smoke without contacting Cloudera Manager, impalad, Spark, Trino,
Prometheus, Ollama, or external LLM services.
For local development from a checkout:
python3 -m venv .venv
. .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e ".[dev]"
pre-commit installLocal JSON configuration is documented in docs/configuration.md.
The preferred workstation path is ~/.qdcreds/query-doctor-config.json;
secrets stay in environment variables or local env files. Start from
query-doctor-config.minimal.example.json for a Cloudera Manager Impala
workflow, then use query-doctor-config.example.json only when you need the
advanced direct-Impala, Prometheus, metadata, or LLM routing fields.
Use the smallest path that matches the access you have.
| Door | Use when | Starts from |
|---|---|---|
| One exported profile | You can get one Impala Web UI text profile, but cannot grant live access yet. | query-doctor-analyze --profile-text or query-doctor-web with manual_profile_dir |
| Synthetic demo | You want a read-only local click-through with no real data. | query-doctor-web --public-demo |
| Minimal CM scan | You have read-only Cloudera Manager access for an Impala service. | query-doctor-web or query-doctor-batch-recent |
The lowest-setup path is one exported Apache Impala text profile to one local diagnosis. This does not contact Cloudera Manager or impalad, does not require Kerberos, metadata collection, Prometheus, or an LLM provider, and does not upload the raw profile through the browser.
query-doctor-analyze \
--profile-text ./exported-impala-profile.txt \
--out cases/cm-corpusThe command stages a collector-shaped local case under cases/cm-corpus,
redacts users, hosts, credentials, and common secret forms by default, writes
analysis_facts.md plus analysis.json, and prints the output case directory.
Use --redact-identifiers when the staged local artifacts may be shared. The
manual profile intake accepts exported text profiles only; JSON, Thrift, and
profile-v2 payloads remain outside this entry path. The CLI uses the Query ID
header from the exported profile, or the downloaded Impala Web UI filename when
it has the strict profile_<query-id-high>_<query-id-low> shape. If neither is
readable, add --query-id <query-id>; when multiple Query ID sources are
present, they must match before the local case is written.
To inspect staged cases in the local UI, start query-doctor-web --corpus-dir cases/cm-corpus from the same workspace. The Diagnose page opens an Exported
Profiles results table from complete manual-profile cases in that corpus
without requiring Cloudera Manager settings, credentials, or default local
config. You can still choose One Query ID and enter the Query ID from a staged
profile to reopen that exact case. LLM narrative and optimizer actions remain
explicit buttons.
You can also configure a local profile inbox for the web UI. Put the exported
text profile in manual_profile_dir using the Query ID slug as the file name
(for example, replace the Query ID separator with _ and save
<query-id-slug>.txt), start query-doctor-web, choose One Query ID, and
enter the original Query ID. The web path stages and analyzes the local file
through the same text-only, bounded, redacted analyzer path; it does not upload
the raw profile through the browser. If the file contains an embedded Query ID
for a different query, staging fails closed before replacing any existing case.
For a self-contained one-profile workspace, set both paths in an ignored local
config file and keep generated cases outside the source tree:
{
"manual_profile_dir": "/path/to/profile-inbox",
"corpus_dir": "/path/to/query-doctor-cases",
"no_llm": true
}Then start query-doctor-web --config ./query-doctor-one-profile.json.
Relative corpus_dir values in config resolve from the config file; the
--corpus-dir CLI flag resolves relative paths from the current directory.
When neither is set, the web UI stores generated Query ID cases under
./cases/cm-corpus from the directory where you started query-doctor-web.
Profile text does not include a Query ID: keep the original Impala Web UI download name when it has the strictprofile_<query-id-high>_<query-id-low>shape, or pass--query-id <query-id>. Query Doctor also accepts aQuery ID:header inside the text export. If multiple Query ID sources are present, they must match.Parsed operators: 0: the case is still staged and can open in the UI, but that text export did not include a parseableExecSummary/operator table. Use the preserved Impala text profile export when available; JSON, Thrift, and profile-v2 payloads are outside this manual profile path.query-doctor-web --corpus-dir cases/cm-corpusasks for Cloudera Manager settings: confirm thatquery-doctor-analyzewrote a complete case under the same corpus directory you pass to web, and run web from the same workspace or use an absolute--corpus-dir.
The synthetic demo is the fastest way to see the product. It is deterministic, local-only, and contains no real SQL, profiles, metadata, hostnames, users, or credentials.
query-doctor-web --public-demoThis one-command mode is documented in docs/demo-mode.md. It generates the synthetic demo pack in a dedicated temp directory, forces Python-only mode, ignores default local config, and blocks all POST actions.
If you need to inspect or reuse the generated pack manually, use the lower-level commands:
query-doctor-demo-preflight
DEMO_PACK="${TMPDIR:-/tmp}/query-doctor-demo-pack"
query-doctor-demo --out "$DEMO_PACK" --overwrite
QUERY_DOCTOR_ACTION_OUTCOMES_PATH="$DEMO_PACK/action_outcomes.jsonl" \
query-doctor-web --host 127.0.0.1 --port 8766 --batch-summary "$DEMO_PACK/batch_summary.json"Open the localhost URL printed by query-doctor-web. Start with
/?query_group=workloads#scan-context to show the compact Scan context
workload follow-up links and local synthetic action outcomes before opening
Workload Details.
When a Recent summary has repeated safe row-level workload fingerprints but no
materialized workload payload, the UI derives bounded repeated workload details
from the sanitized rows; those details have no baseline or regression claim
until local history evidence is available.
The local web UI starts with a bounded search form and renders synthetic Finished Queries results for review:
The synthetic demo pack contains eleven sanitized Impala cases covering workload follow-up, repeated patterns, trusted optimizer recommendations, stats maintenance, storage/HDFS follow-up, frequent-short workloads, mixed signals, unknown but useful limited evidence, and direct-Impala compatibility. It also includes two read-only raw-free Trino Beta demo cases rendered from static compact diagnosis facts, without contacting a Trino coordinator or enabling Details, reports, optimizer behavior, generated SQL, or SQL execution. See docs/demo-cases.md for the full scenario list and talk track.
Use this when you have read-only Cloudera Manager access for an Impala service.
Keep secrets in the shell environment or a local env file, not in JSON config.
Create ~/.qdcreds/cm-ro.env with CM_USERNAME plus CM_PASSWORD or
CM_TOKEN before sourcing it.
mkdir -p ~/.qdcreds
cp query-doctor-config.minimal.example.json ~/.qdcreds/query-doctor-config.json
# Edit ~/.qdcreds/query-doctor-config.json with CM URL, cluster, service, and CA bundle if needed.
set -a
source ~/.qdcreds/cm-ro.env
set +a
query-doctor-web \
--config ~/.qdcreds/query-doctor-config.json \
--host 127.0.0.1 \
--port 8765For a headless bounded Recent scan without automatic LLM reports:
query-doctor-batch-recent \
--config ~/.qdcreds/query-doctor-config.json \
--recent-window-minutes 60 \
--triage-profile-limit 10 \
--top-reports 0The minimal path uses Cloudera Manager for Impala Recent discovery and profile collection. Add metadata, CM time-series, direct Impala, Prometheus, or LLM settings only after this basic scan path works. See docs/configuration.md and docs/credentials.md.
query-doctor-self-test --help: local installed-package confidence check over synthetic data and core offline user paths.query-doctor-web --help: local browser UI for Recent scan, Running now, one Known Query ID, Details pages, explicit report actions, and explicit details-page optimizer actions.query-doctor-batch-recent --help: headless Recent scan workflow for bounded local collection and ranking.query-doctor-analyze --help: deterministic analyzer over collected local case files, or over one staged local exported Impala text profile.query-doctor-report --help: validated report generation from Python-owned facts.query-doctor-optimize-query --help: read-only pasted-SQL optimizer review.
Every packaged console script accepts --help. Root-level compatibility
launchers have been removed; use query-doctor-* commands or
python -m query_doctor.cli.<command_module> from an uninstalled checkout.
Query Doctor is supported as a single-user, local-first tool run by an operator
with their own local Cloudera Manager, Kerberos, Impala, Prometheus, and LLM
credentials. Use localhost or a tightly controlled local bind for the web UI.
Do not deploy ordinary local mode as a shared service without a separate design
for authentication, authorization, tenant/job isolation, audit logging,
TLS/reverse-proxy trust, and resource limits. Shared public demos should use
the read-only query-doctor-web --public-demo mode. Shared owner_raw source
access requires authenticated per-request viewer identity; today that means an
explicit viewer_identity_header supplied only by a trusted auth proxy or
ingress that strips inbound copies of the same header. The shared/non-local
deployment checklist is
docs/owner-raw-d3-deployment.md.
- Python/analyzer-owned facts are the only trusted diagnostic evidence.
- Raw LLM output is untrusted unless normalized, sanitized, and validated.
- Trusted browser/report surfaces must not expose raw SQL, raw profiles, raw metadata, local paths, secrets, subprocess output, model/runtime internals, or raw artifact filenames. The isolated owner-only selected-case source surface is the narrow raw-SQL browser exception.
- External collection must be explicit, bounded, read-only, redacted, and safe by default.
- Local config
privacy_modedefaults totrue; disabling it can relax local artifact identifier/host masking, but trusted browser/report surfaces still do not show raw SQL, profiles, or metadata. - Local config
no_llm=truekeeps report and optimizer actions on deterministic Python-owned output. - SQL browser exceptions are selected-case and owner-gated: Details can show a
validated optimizer SQL draft for an explicit safe-to-attempt optimizer action
when
source_visibility=owner_raw, and the isolated owner-only source view can show read-only original SQL for an authorized query owner. On localhost, raw viewer subjects come from local collectable owner users; on shared binds they must come from authenticated per-request viewer identity. The original source view can be disabled globally withowner_raw_source_enabled=falseor--disable-owner-raw-source, and each attempt writes a reason-coded raw-free server audit line. The defaultsafemode shows trusted recommendations/no-rewrite guidance instead. - Query Optimizer accepts only a single safe read-only statement and never executes pasted SQL.
See docs/safety-contract.md for the full trust and redaction contract, including the narrow owner-only source exception. For shared/non-local owner-raw deployment, use docs/owner-raw-d3-deployment.md. For a reviewer-oriented overview, see docs/security-model.md.
Start with docs/README.md. It separates current user docs, operations guides, architecture contracts, audit docs, and supporting references.
High-value next reads:
- docs/demo-mode.md: synthetic demo pack generation and README screenshot refresh path.
- docs/DEMO.md: localhost UI demo runbook and talk track.
- docs/local-smoke.md: local validation and smoke checks.
- docs/credentials.md: local credentials layout.
- docs/roadmap.md: implemented scope and planned seams.
- docs/security-model.md: public security, privacy, and demo-sharing overview.
- docs/query-optimizer-contract.md: optimizer trust boundary.
- docs/release-checklist.md: final tag, package-index, and visibility-change checklist.
The canonical documentation language is English. The main Russian companion README is README.ru.md; additional Russian companion pages live under docs/i18n/ru/ where useful.
For ordinary changes, run focused tests for the touched area and always run:
git diff --checkUse docs/agent-quickstart.md and docs/test-matrix.md to choose focused validation. Before release cleanup or public-sharing work, broaden to:
pre-commit run --all-files
scripts/local_gate.sh
query-doctor-demo-preflight --public-releaseStage only explicit files. Do not commit generated cases, reports, local configs, credentials, raw profiles, raw metadata, or temporary outputs.
This repository is public. Public source releases start at v0.4.2; v0.9.0
continues that public source release line. Older package-index releases remain
visible on
query-doctor on PyPI where needed for
installed-artifact history. The public license is Apache-2.0.
PyPI publishing uses GitHub OIDC Trusted Publishing. The repository-side
testpypi and pypi environments require maintainer approval and do not use
stored package-index API tokens.
Query Doctor is licensed under the Apache License, Version 2.0
(Apache-2.0). See LICENSE.
Apache, Apache Impala, and Impala are trademarks of The Apache Software Foundation. Query Doctor is an independent project and is not endorsed by The Apache Software Foundation or the Apache Impala project.

