[Draft] Add kernel stack cost-per-packet metrics, nodeconfig collector, and A…#3555
[Draft] Add kernel stack cost-per-packet metrics, nodeconfig collector, and A…#3555midu16 wants to merge 1 commit intoprometheus:masterfrom
Conversation
…I-Helpers docs - Add ebpf-pmd-jitter collector (Linux): in-tree eBPF program (collector/bpf/latency.c) measures kernel stack packet latency (XDP→TC); exposes latency min/max/avg, jitter, histogram, and collector_up/load_error/object_path_configured. Disabled by default; requires built eBPF object and --collector.ebpf-pmd-jitter.object-path. - Add nodeconfig collector (Linux): runbook-oriented metrics from sysfs and DMI (PCIe NIC link width, slot ok, cores dedicated, memory banks full). Disabled by default. - Add cmd/kernel_stack_stress_server: TCP server for functional test (variable backlog, rcvbuf, read delay, hold connections). - Add kernel_stack_af_packet_functional_test.go: Linux-only root functional test (netns) for conntrack drops, listen overflow, TCPRcvQDrop, traffic+NUMA, traffic+pcap; preserves pcaps under /tmp/node_exporter_kernel_stack_pcaps_*. - Add docs/KERNEL_STACK_AF_PACKET_METRICS.md: full guide correlating metrics with cost per packet and AF_PACKET, optional collectors, examples, and functional test.
karampok
left a comment
There was a problem hiding this comment.
imo that should be a PR with two commits (or two PRs)
- one for the nodeconfig_linux
- one for the ebpf
unless there is a connection that I miss.
For the commit/PRs description, like why we need those metrics
## Summary
<1-3 sentences: what this PR adds/changes and why>
Closes #NNNN (if applicable)
## Motivation
<2-4 sentences: the operational problem this solves,
what users cannot do today, or what issue this addresses>
## Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `node_<subsystem>_<name>` | Gauge/Counter | ... |
## Implementation notes
- Data source (procfs, sysfs, netlink, eBPF, etc.)
- Disabled by default, enable with `--collector.<name>`
- Build tag to exclude: `no<name>`
- Graceful degradation (ErrNoData when source missing)
- Cardinality bound
- Dependency changes (if any)
## Testing
- Unit tests added (`collector/<name>_test.go`)
- Fixture file (`collector/fixtures/...`)
- e2e golden output updated (if applicable)
- Manual validation on real hardware (if applicable)
## Example output
```text
# HELP node_<metric> ...
# TYPE node_<metric> gauge
node_<metric>{label="value"} 42
There was a problem hiding this comment.
How should I see this file? what is excalidraw (I suppose a diagram)
I suppose you could either bring as png in markdown (or maybe is not needed)
| return &nodeconfigCollector{ | ||
| fs: fs, | ||
| logger: logger, | ||
| pcieNICMinLinkWidthDesc: prometheus.NewDesc( |
There was a problem hiding this comment.
There is https://github.com/prometheus/node_exporter/blob/master/collector/pcidevice_linux.go should those metrics be added there?
| ), | ||
| pcieSlotOkDesc: prometheus.NewDesc( | ||
| prometheus.BuildFQName(namespace, nodeconfigSubsystem, "pcie_slot_ok"), | ||
| "Whether PCIe slot/width is considered correct (1) or not (0). Derived from PCIe: 1 when minimum NIC link width >= 16, 0 otherwise. Absent if no network PCIe devices.", |
There was a problem hiding this comment.
I think what is ok or nok should not be hardcoded in the metrics. Metrics should only state the value.
| ), | ||
| coresDedicatedDesc: prometheus.NewDesc( | ||
| prometheus.BuildFQName(namespace, nodeconfigSubsystem, "cores_dedicated"), | ||
| "Whether CPU cores are dedicated/isolated for workload (e.g. DPDK). 1 if at least one CPU is in /sys/devices/system/cpu/isolated, 0 otherwise.", |
There was a problem hiding this comment.
There should be elsewhere gathering CPU metrics, should that be added there?
| ), | ||
| memoryBanksFullDesc: prometheus.NewDesc( | ||
| prometheus.BuildFQName(namespace, nodeconfigSubsystem, "memory_banks_full"), | ||
| "Whether memory channels/banks are fully populated (1) or not (0). Derived from DMI/SMBIOS: 1 when all memory device slots have a populated DIMM, 0 otherwise. Absent if DMI not available.", |
There was a problem hiding this comment.
that is brand new, probably fits into new collector but with different name (like nodeconfig is a bit generic)
There was a problem hiding this comment.
Do we need to bring that main command in the git? Is it required for the metrics collection?
|
eBPF requires privileges to use, which is against our contributing guidelines |
|
@midu16 would you say this statement is accurate? |
|
I recommend looking into the ebpf_exporter. This is a general-use eBPF metrics collector that is more suited to this functionality. |
…I-Helpers docs