feat(examples): add service-localization multi-domain agent demo by qiansheng91 · Pull Request #36 · alibaba/UnifiedModel

qiansheng91 · 2026-06-11T06:10:23Z

Summary

A new scenario-driven example pack — examples/service-localization/ — showing how an AI agent uses UModel to fetch telemetry and localize a bottleneck down a four-layer request stack (product API → service → datastore → infrastructure). Data retrieval is the hero.

It complements Incident Investigation: that demo is reactive root-cause analysis (symptom → cause via a runbook, walking horizontally across callers + business); this one is vertical localization — walk the request stack down, fetch the signal at each hop, attribute the latency to a layer.

Scenario

checkout-api breaches its 300ms latency SLO. Walking the critical path one hop at a time and fetching the saturation signal at each layer localizes the cause:

Checkout Flow (impacted) → checkout-api (degraded) → order-svc (degraded, CPU healthy)
  → orders-db (SATURATED — connection pool ~98%)  ← root
  → node-a (healthy)                               ← infra ruled out

Sibling services (catalog/search/payment/inventory) and all infra nodes are healthy, so the localization narrows cleanly to the datastore connection pool.

What's in the pack (4 domains)

6 entity sets: product.api, product.journey, service.app, data.store, infra.node, infra.pod
7 entity_set_links forming the vertical stack: depends_on / calls / reads_writes / runs_on / hosted_on / scheduled_on
4 metric_sets — including data.store.metrics.connection_pool_usage (the localization signal) — 1 log_set, prometheus + elasticsearch storage, with 5 data_links and 5 storage_links
23 entities / 29 relations encoding the planted bottleneck (md5-hex ids per CMS 2.0)
bilingual README with a 5-step data-retrieval walkthrough (find → discover datasets → step down the path → pull the signal → rule out the layer below)
MCP-driven test-integration.sh (16/16), using the correct query arg key and the safe PASS=$((PASS+1)) counter idiom (the bugs fixed in feat(examples): add telemetry layer to incident-investigation demo #32)

Registered in internal/sampledata sampleCatalog as service-localization (aliases bottleneck-localization, examples/service-localization) and linked from the root README (en/cn) + docs index, so it is discoverable from day one.

Test plan

make example-validate — all 30 new model YAMLs pass
make ci — green
examples/service-localization/test-integration.sh — 16/16
Manual against make quickstart QUICKSTART_SAMPLE=examples/service-localization:
- .entity ... query='degraded' → checkout-api
- getDirectRelations walks checkout-api → order-svc → orders-db → node-a
- get_metrics('data','data.store.metrics','connection_pool_usage') → Prometheus plan with the orders-db id substituted
- list_data_set surfaces the service metric + log sets

Notes for reviewers

One-hop traversal. The walkthrough uses getDirectRelations per hop rather than a single deep getNeighborNodes. In the memory graphstore, getNeighborNodes depth > 1 does not expand transitively (depth 2 == depth 1; depth 3 returns empty), and one-hop stepping is also the more faithful model of how an agent localizes. Flagging in case the multi-hop behavior is worth a separate look.
Plan-only, matching the open-source plan-provider boundary — the bottleneck lives in entity status + the planted topology, so the path is fully reproducible offline.

Follow-up

The analysis/localization skills — both the model-resident runbook_set and standalone SKILL.md files — land in a follow-up PR stacked on this data pack.

A new scenario-driven example pack showing how an AI agent uses UModel to fetch telemetry and localize a bottleneck down a four-layer request stack (product API -> service -> datastore -> infrastructure). It complements incident-investigation: that demo is reactive root-cause analysis; this one is vertical localization with data retrieval as the hero. Scenario: checkout-api breaches its latency SLO. Walking the critical path one hop at a time (getDirectRelations) and fetching the saturation signal at each layer localizes the cause to orders-db's connection pool (~98%), while the hosting node is healthy and sibling services are fine. Contents (4 domains, all model files pass make example-validate): - 6 entity sets: product.api/journey, service.app, data.store, infra.node/pod - 7 entity_set_links forming the vertical stack (calls / reads_writes / runs_on / hosted_on / scheduled_on / depends_on) - 4 metric_sets (incl. data.store.metrics.connection_pool_usage, the localization signal) + 1 log_set + prometheus/elasticsearch storage, with 5 data_links and 5 storage_links - 23 entities / 29 relations encoding the planted bottleneck (md5-hex ids per CMS 2.0) - bilingual README with a 5-step data-retrieval walkthrough - MCP-driven test-integration.sh (16/16), using the correct query arg key and the safe PASS counter idiom Registered in internal/sampledata sampleCatalog as "service-localization" (aliases: bottleneck-localization, examples/service-localization) and linked from the root README (en/cn) and docs index so it is discoverable from day one. The runbook + standalone agent skills (model-resident and SKILL.md forms) land in a follow-up PR on top of this data pack. Verified: make example-validate, make ci, test-integration.sh (16/16), and manual get_metrics/get_logs/.topo probes against `make quickstart QUICKSTART_SAMPLE=examples/service-localization`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add two ways to exercise the localization flow as a whole, not just piecemeal: - examples/service-localization/demo.sh — a narrated, runnable replay of the agent's bottleneck-localization loop against a live server. Prints each SPL, the key result, and the reasoning per hop (symptom → entry → hop 1 calls → service CPU healthy → hop 2 reads_writes → datastore saturation PromQL → hop 3 hosted_on → node healthy → conclusion), and asserts the load-bearing facts (10 checks) so it doubles as a smoke gate. 10/10 against `make quickstart QUICKSTART_SAMPLE=examples/service-localization`. - internal/bootstrap/localization_test.go (TestServiceLocalizationPath) — an in-process gate that imports the sample and walks the same path via Query.Execute, asserting the topology hops, the connection_pool_usage plan rendering with the orders-db id substituted, and dataset discovery. Runs under `make ci`, so the demo path can't silently rot if the sample data, links, or datasets drift. Both READMEs point at demo.sh and note the CI coverage. Verified: TestServiceLocalizationPath passes; make ci green; demo.sh 10/10. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qiansheng91 added documentation Improvements or additions to documentation enhancement New feature or request labels Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): add service-localization multi-domain agent demo#36

feat(examples): add service-localization multi-domain agent demo#36
qiansheng91 wants to merge 2 commits into
mainfrom
codex/demo-service-localization

qiansheng91 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qiansheng91 commented Jun 11, 2026

Summary

Scenario

What's in the pack (4 domains)

Test plan

Notes for reviewers

Follow-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant