Skip to content

Palo Alto Cortex XDR CCP 3.0.5 & Cortex XDR 3.0.2 - fix PaloAltoCortexXDR_Alerts_CL offset-pagination loop#14384

Open
v-krishnachi wants to merge 1 commit into
Azure:masterfrom
v-krishnachi:fix/cortex-xdr-ccp-alerts-paging-icm-2605070040011703
Open

Palo Alto Cortex XDR CCP 3.0.5 & Cortex XDR 3.0.2 - fix PaloAltoCortexXDR_Alerts_CL offset-pagination loop#14384
v-krishnachi wants to merge 1 commit into
Azure:masterfrom
v-krishnachi:fix/cortex-xdr-ccp-alerts-paging-icm-2605070040011703

Conversation

@v-krishnachi
Copy link
Copy Markdown
Contributor

@v-krishnachi v-krishnachi commented May 30, 2026

Palo Alto Cortex XDR CCP 3.0.5 & Cortex XDR 3.0.2 — fix PaloAltoCortexXDR_Alerts_CL offset-pagination loop

Solutions affected: Palo Alto Cortex XDR CCP (3.0.4 → 3.0.5) and the legacy Cortex XDR (3.0.1 → 3.0.2) — both ship the same CortexXdrAlerts RestApiPoller resource and therefore both carry the same defect.
ICM (internal): 2605070040011703
Type: Bug fix (request-body correction + sort-order correction)
Risk: Low — change is scoped to the CortexXdrAlerts RestApiPoller resource only; the other four streams (Agent / Endpoints / Incidents / Management) are byte-for-byte unchanged.


Symptom

A customer running the Palo Alto Cortex XDR CCP solution stopped ingesting any rows into PaloAltoCortexXDR_Alerts_CL, while the other four Cortex XDR streams (Agent / Endpoints / Incidents / Management) continued ingesting normally. Palo Alto's API-gateway telemetry showed the customer's API key emitting ~120–200 POST requests/minute against POST /public_api/v1/alerts/get_alerts, all using the same fixed creation_time start value (~2025-05-05), sustained across many minutes. Palo Alto rate-limited the key with HTTP 429 and asked Microsoft to reduce call volume.

Disabling and re-enabling the connector through the Sentinel portal twice did not help — the freshly recreated connector immediately re-entered the same loop against the still-throttled key.

Root cause

The CortexXdrAlerts rule in PollingConfig.json declares an offset-paging block:

"paging": {
  "offsetParaName":    "$.request_data.search_from",
  "pagingType":        "Offset",
  "offsetEndParaName": "$.request_data.search_to",
  "pageSize":          100
}

…but the request body it points at did not contain those two keys:

"queryParametersTemplate":
  "{ 'request_data': { 'filters': [
       { 'field': 'creation_time', 'operator': 'gte', 'value': {_QueryWindowStartTime} },
       { 'field': 'creation_time', 'operator': 'lte', 'value': {_QueryWindowEndTime}   }
     ],
     'sort': { 'field': 'creation_time', 'keyword': 'desc' }
  } }"

How CCP RestApiPoller offset paging actually works in the runtime:

  1. The runtime substitutes {_QueryWindowStartTime} / {_QueryWindowEndTime} from its internal cursor state, then issues page 1.
  2. After each response it inspects eventsJsonPaths ($.reply.alerts). If the page is "full" (count ≥ pageSize), it writes the next offsets to the JSONPaths configured in offsetParaName / offsetEndParaName and issues the next page.
  3. Paging stops only when a response returns fewer than pageSize rows under the configured JSONPath.

Because $.request_data.search_from and $.request_data.search_to were not present in the body, the runtime's per-page writes silently no-op'd. Every page request therefore went out identical to the previous one. Combined with sort.keyword = "desc" on creation_time (which makes XDR's offset pagination non-deterministic under load — new alerts arriving mid-poll push existing alerts down by one, breaking offset semantics), the API kept returning full pages indefinitely, the runtime kept asking for more, and a single 5-minute poll cycle entered an unbounded loop.

Because the loop never produced a "Success" cycle, _QueryWindowStartTime never advanced — every subsequent poll re-attempted the same frozen window. That is why Palo Alto's SRE team saw a fixed creation_time date repeated on every request.

Why only Alerts was affected: Endpoints has no paging block, and Agent / Incidents / Management use the same paging block but their data volume per 5-minute window is virtually always below pageSize = 100, so the runtime exits on the very first page and the bug never triggers. The Alerts stream is the only high-volume event-style stream in this connector.

Reference documentation

Fix

Two surgical changes, applied only to the CortexXdrAlerts RestApiPoller resource in both solutions:

  1. Added 'search_from': 0, 'search_to': 100 to the Alerts queryParametersTemplate request body. The literal values are the seed values the runtime uses for page 1 (they match PAN's documented defaults — search_from = 0 zero-based, search_to = 100 matches PAN's hard 100-row page-size cap). On subsequent pages the runtime now has real JSON nodes to overwrite via offsetParaName / offsetEndParaName, so pagination advances correctly and exits on a short final page.
  2. Changed sort.keyword for creation_time from descasc. PAN explicitly allows the override, and offset pagination is only deterministic with ascending sort on the same field used in filters. New alerts created mid-poll then append at the end of the result set instead of shifting every existing offset down by one.

Expected steady-state call rate for the Alerts stream drops from ~120–200 req/min (loop) to approximately ceil(N / 100) POSTs per 5-minute window — typically 1–3 POSTs per cycle, i.e. ≈ 0.2–0.6 req/min, well inside PAN's documented limits.

Alignment with the Palo Alto contract

PAN "Get All Alerts" doc clause Patched template Aligned?
POST /public_api/v1/alerts/get_alerts, body root request_data unchanged
request_data.filters (AND-joined) unchanged (creation_time gte AND creation_time lte)
request_data.search_from integer, zero-based, default 0 seeded 0
request_data.search_to integer, default 100, maximum result set size 100 seeded 100 ✅ — matches PAN's hard cap exactly
(search_to − search_from) ≤ 100 first page 100−0, then 200−100, 300−200, … always 100
request_data.sort (asc/desc both valid; default desc) overridden to asc (documented capability)
Filter value is Unix-millis for time fields queryTimeFormat: UnixTimestampInMills
Response shape reply.alerts[] eventsJsonPaths: ["$.reply.alerts"]

Files changed (10)

Palo Alto Cortex XDR CCP solution (3.0.4 → 3.0.5)

  1. Solutions/Palo Alto Cortex XDR CCP/Data Connectors/CortexXDR_ccp/PollingConfig.json — Alerts request body + sort.
  2. Solutions/Palo Alto Cortex XDR CCP/Package/mainTemplate.json — same fix on the embedded CortexXdrAlerts resource; _solutionVersion (line 48) and contentPackages.properties.version (line 2675) bumped to 3.0.5.
  3. Solutions/Palo Alto Cortex XDR CCP/Data/Solution_CortexXDR.jsonVersion bumped to 3.0.5.
  4. Solutions/Palo Alto Cortex XDR CCP/ReleaseNotes.md — new 3.0.5 row.
  5. Solutions/Palo Alto Cortex XDR CCP/Package/3.0.5.zip — regenerated solution package.

Legacy Cortex XDR solution (3.0.1 → 3.0.2)

  1. Solutions/Cortex XDR/Data Connectors/CortexXDR_ccp/PollingConfig.json — Alerts request body + sort.
  2. Solutions/Cortex XDR/Package/mainTemplate.json — same fix on the embedded CortexXdrAlerts resource; _solutionVersion (line 48) and contentPackages.properties.version (line 3747) bumped to 3.0.2.
  3. Solutions/Cortex XDR/Data/Solution_CortexXDR.jsonVersion bumped to 3.0.2 (and aligned with mainTemplate; was a pre-existing 3.0.0 drift).
  4. Solutions/Cortex XDR/ReleaseNotes.md — new 3.0.2 row.
  5. Solutions/Cortex XDR/Package/3.0.2.zip — regenerated solution package.

Regression analysis

  • Other four streams (Agent / Endpoints / Incidents / Management): unchanged. Their data volume per window is always < pageSize, so they exit on the first page and the same bug never triggered for them. Touching them would only expand the regression surface for no benefit.
  • DataConnectorDefinition, DCR, custom tables, analytic rules, workbooks, parsers: unchanged in this PR.
  • Schema / TimeGenerated mapping for PaloAltoCortexXDR_Alerts_CL: unchanged. Switching sort from descasc changes only the order rows are paged in within a window, not which rows are ingested.
  • Auth flow / DCE / DCR routing: unchanged.
  • ContentHub upgrade path for existing customers: on upgrade, ARM does an in-place PUT of the CortexXdrAlerts<GUID> resource. The server-managed nextQueryWindowStartTime cursor is not in the request body schema and is preserved, so existing customers resume ingestion from their current cursor (no data loss, no duplicate ingestion).

Validation

End-to-end resolution confirmed on the ICM customer's workspace after deploying the fix:

  • PaloAltoCortexXDR_Alerts_CL ingesting normally.
  • creation_time gte in successive XDR requests advancing every 5-minute cycle (the previously frozen 2025-05-05 value is unstuck).
  • Zero HTTP 429s observed on the customer's API key over a 30-minute observation window.
  • Per-minute POST count to /public_api/v1/alerts/get_alerts from the customer's key dropped from ~120–200 to single-digit req/min.

Notes for reviewers

  • The change to sort.keyword is intentional and required — desc with offset paging on a continuously growing dataset is non-deterministic and is the underlying reason the loop never terminated even though search_from / search_to would not have been writable. Both fixes are needed together; either one alone would not fully resolve the symptom.
  • The literal values 'search_from': 0, 'search_to': 100 are seed values for page 1. The CCP runtime overwrites both nodes for subsequent pages using the paging.offsetParaName / paging.offsetEndParaName JSONPaths — no further per-page configuration is needed.
  • Solution version strings are now consistent across mainTemplate.json (variables._solutionVersion and contentPackages.properties.version), Data/Solution_CortexXDR.json (Version), and ReleaseNotes.md. The legacy Cortex XDR solution's Data/Solution_CortexXDR.json had a pre-existing Version: 3.0.0 while its mainTemplate was already at 3.0.1; this PR aligns both to 3.0.2.

… loop

Fixes runaway offset-pagination loop on PaloAltoCortexXDR_Alerts_CL that produced HTTP 429 throttling from the Cortex XDR API.

Root cause: the CortexXdrAlerts RestApiPoller declared pagingType=Offset with JSONPaths $.request_data.search_from / $.request_data.search_to, but queryParametersTemplate did not contain those keys, so the CCP runtime's per-page offset writes silently no-op'd and every page request was identical. Combined with sort.keyword=desc on creation_time (non-deterministic offset paging under load) the runtime entered an unbounded paging loop, never advanced _QueryWindowStartTime, and re-emitted the same fixed creation_time on every request.

Fix: add 'search_from':0, 'search_to':100 to the Alerts queryParametersTemplate (matches PAN's documented defaults and 100-row page-size cap) and switch sort.keyword to 'asc' so offset pagination is deterministic. Scope is the CortexXdrAlerts stream only; the other four streams (Agent / Endpoints / Incidents / Management) are byte-for-byte unchanged. Same fix applied to the legacy Cortex XDR solution (3.0.1 -> 3.0.2). Solution version strings bumped in mainTemplate.json (_solutionVersion and contentPackages.properties.version), Data/Solution_CortexXDR.json, and ReleaseNotes.md; package zips regenerated.
@v-krishnachi v-krishnachi requested review from a team as code owners May 30, 2026 04:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant