Skip to content

feat(examples): add service-localization multi-domain agent demo#36

Open
qiansheng91 wants to merge 2 commits into
mainfrom
codex/demo-service-localization
Open

feat(examples): add service-localization multi-domain agent demo#36
qiansheng91 wants to merge 2 commits into
mainfrom
codex/demo-service-localization

Conversation

@qiansheng91

Copy link
Copy Markdown
Collaborator

Summary

A new scenario-driven example pack — examples/service-localization/ — showing how an AI agent uses UModel to fetch telemetry and localize a bottleneck down a four-layer request stack (product API → service → datastore → infrastructure). Data retrieval is the hero.

It complements Incident Investigation: that demo is reactive root-cause analysis (symptom → cause via a runbook, walking horizontally across callers + business); this one is vertical localization — walk the request stack down, fetch the signal at each hop, attribute the latency to a layer.

Scenario

checkout-api breaches its 300ms latency SLO. Walking the critical path one hop at a time and fetching the saturation signal at each layer localizes the cause:

Checkout Flow (impacted) → checkout-api (degraded) → order-svc (degraded, CPU healthy)
  → orders-db (SATURATED — connection pool ~98%)  ← root
  → node-a (healthy)                               ← infra ruled out

Sibling services (catalog/search/payment/inventory) and all infra nodes are healthy, so the localization narrows cleanly to the datastore connection pool.

What's in the pack (4 domains)

  • 6 entity sets: product.api, product.journey, service.app, data.store, infra.node, infra.pod
  • 7 entity_set_links forming the vertical stack: depends_on / calls / reads_writes / runs_on / hosted_on / scheduled_on
  • 4 metric_sets — including data.store.metrics.connection_pool_usage (the localization signal) — 1 log_set, prometheus + elasticsearch storage, with 5 data_links and 5 storage_links
  • 23 entities / 29 relations encoding the planted bottleneck (md5-hex ids per CMS 2.0)
  • bilingual README with a 5-step data-retrieval walkthrough (find → discover datasets → step down the path → pull the signal → rule out the layer below)
  • MCP-driven test-integration.sh (16/16), using the correct query arg key and the safe PASS=$((PASS+1)) counter idiom (the bugs fixed in feat(examples): add telemetry layer to incident-investigation demo #32)

Registered in internal/sampledata sampleCatalog as service-localization (aliases bottleneck-localization, examples/service-localization) and linked from the root README (en/cn) + docs index, so it is discoverable from day one.

Test plan

  • make example-validate — all 30 new model YAMLs pass
  • make ci — green
  • examples/service-localization/test-integration.sh16/16
  • Manual against make quickstart QUICKSTART_SAMPLE=examples/service-localization:
    • .entity ... query='degraded' → checkout-api
    • getDirectRelations walks checkout-api → order-svc → orders-db → node-a
    • get_metrics('data','data.store.metrics','connection_pool_usage') → Prometheus plan with the orders-db id substituted
    • list_data_set surfaces the service metric + log sets

Notes for reviewers

  • One-hop traversal. The walkthrough uses getDirectRelations per hop rather than a single deep getNeighborNodes. In the memory graphstore, getNeighborNodes depth > 1 does not expand transitively (depth 2 == depth 1; depth 3 returns empty), and one-hop stepping is also the more faithful model of how an agent localizes. Flagging in case the multi-hop behavior is worth a separate look.
  • Plan-only, matching the open-source plan-provider boundary — the bottleneck lives in entity status + the planted topology, so the path is fully reproducible offline.

Follow-up

The analysis/localization skills — both the model-resident runbook_set and standalone SKILL.md files — land in a follow-up PR stacked on this data pack.

A new scenario-driven example pack showing how an AI agent uses UModel
to fetch telemetry and localize a bottleneck down a four-layer request
stack (product API -> service -> datastore -> infrastructure). It
complements incident-investigation: that demo is reactive root-cause
analysis; this one is vertical localization with data retrieval as the
hero.

Scenario: checkout-api breaches its latency SLO. Walking the critical
path one hop at a time (getDirectRelations) and fetching the saturation
signal at each layer localizes the cause to orders-db's connection pool
(~98%), while the hosting node is healthy and sibling services are fine.

Contents (4 domains, all model files pass make example-validate):
- 6 entity sets: product.api/journey, service.app, data.store,
  infra.node/pod
- 7 entity_set_links forming the vertical stack (calls / reads_writes /
  runs_on / hosted_on / scheduled_on / depends_on)
- 4 metric_sets (incl. data.store.metrics.connection_pool_usage, the
  localization signal) + 1 log_set + prometheus/elasticsearch storage,
  with 5 data_links and 5 storage_links
- 23 entities / 29 relations encoding the planted bottleneck (md5-hex
  ids per CMS 2.0)
- bilingual README with a 5-step data-retrieval walkthrough
- MCP-driven test-integration.sh (16/16), using the correct query arg
  key and the safe PASS counter idiom

Registered in internal/sampledata sampleCatalog as "service-localization"
(aliases: bottleneck-localization, examples/service-localization) and
linked from the root README (en/cn) and docs index so it is discoverable
from day one.

The runbook + standalone agent skills (model-resident and SKILL.md
forms) land in a follow-up PR on top of this data pack.

Verified: make example-validate, make ci, test-integration.sh (16/16),
and manual get_metrics/get_logs/.topo probes against
`make quickstart QUICKSTART_SAMPLE=examples/service-localization`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@qiansheng91 qiansheng91 added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 11, 2026
Add two ways to exercise the localization flow as a whole, not just
piecemeal:

- examples/service-localization/demo.sh — a narrated, runnable replay of
  the agent's bottleneck-localization loop against a live server. Prints
  each SPL, the key result, and the reasoning per hop (symptom → entry →
  hop 1 calls → service CPU healthy → hop 2 reads_writes → datastore
  saturation PromQL → hop 3 hosted_on → node healthy → conclusion), and
  asserts the load-bearing facts (10 checks) so it doubles as a smoke
  gate. 10/10 against `make quickstart QUICKSTART_SAMPLE=examples/service-localization`.

- internal/bootstrap/localization_test.go (TestServiceLocalizationPath) —
  an in-process gate that imports the sample and walks the same path via
  Query.Execute, asserting the topology hops, the connection_pool_usage
  plan rendering with the orders-db id substituted, and dataset
  discovery. Runs under `make ci`, so the demo path can't silently rot
  if the sample data, links, or datasets drift.

Both READMEs point at demo.sh and note the CI coverage.

Verified: TestServiceLocalizationPath passes; make ci green; demo.sh 10/10.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant