Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions changelog.d/21389_host_metrics_temperature.feature.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
The `host_metrics` source can now collect hardware temperature readings via a
new `temperature` collector. When enabled, it emits `temperature_celsius`,
`temperature_max_celsius`, and `temperature_critical_celsius` gauges, each
tagged with the `component` label of the sensor it was read from.

The collector is opt-in: add `temperature` to the `collectors` list to enable
it. Components that do not report a given value (for example a missing critical
threshold) are skipped, and environments without temperature sensors simply
produce no metrics.

authors: somaz94
10 changes: 9 additions & 1 deletion src/sources/host_metrics/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ mod network;
mod process;
#[cfg(target_os = "linux")]
mod tcp;
mod temperature;

/// Collector types.
#[serde_as]
Expand Down Expand Up @@ -78,6 +79,9 @@ pub enum Collector {

/// Metrics related to TCP connections.
TCP,

/// Metrics related to component temperatures.
Temperature,
}

/// Filtering configuration.
Expand Down Expand Up @@ -186,7 +190,7 @@ pub fn default_namespace() -> Option<String> {
Some(String::from("host"))
}

const fn example_collectors() -> [&'static str; 9] {
const fn example_collectors() -> [&'static str; 10] {
[
"cgroups",
"cpu",
Expand All @@ -197,6 +201,7 @@ const fn example_collectors() -> [&'static str; 9] {
"memory",
"network",
"tcp",
"temperature",
]
}

Expand Down Expand Up @@ -420,6 +425,9 @@ impl HostMetrics {
if self.config.has_collector(Collector::TCP) {
self.tcp_metrics(&mut buffer).await;
}
if self.config.has_collector(Collector::Temperature) {
self.temperature_metrics(&mut buffer).await;
}

let metrics = buffer.metrics;
self.events_received.emit(CountByteSize(
Expand Down
66 changes: 66 additions & 0 deletions src/sources/host_metrics/temperature.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
use sysinfo::Components;
use vector_lib::metric_tags;

use super::HostMetrics;

const COMPONENT: &str = "component";
const TEMPERATURE_CELSIUS: &str = "temperature_celsius";
const TEMPERATURE_MAX_CELSIUS: &str = "temperature_max_celsius";
const TEMPERATURE_CRITICAL_CELSIUS: &str = "temperature_critical_celsius";

impl HostMetrics {
pub async fn temperature_metrics(&self, output: &mut super::MetricsBuffer) {
output.name = "temperature";
let components = Components::new_with_refreshed_list();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist Components before reporting max temperatures

When a Linux sensor does not expose a kernel tempN_highest file, sysinfo::Component::max() is computed by comparing successive refreshes of the same Component. Recreating Components on every temperature_metrics call resets that history, so temperature_max_celsius becomes the current sample on each scrape rather than the highest observed temperature. Keep the Components collection on HostMetrics and refresh it between scrapes, or avoid emitting the computed max when no persistent history is available.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor SYSFS_ROOT when scraping temperatures

In containerized host-metrics deployments that mount the host sysfs somewhere like /host/sys and set SYSFS_ROOT, the other Linux collectors are redirected through init_roots(), but this sysinfo::Components call reads the process' normal sysfs path instead. Enabling the new collector in that documented setup will scrape the container's /sys and commonly emit no host temperature metrics even though the host sensors are mounted under SYSFS_ROOT.

Useful? React with 👍 / 👎.

for component in &components {
let label = component.label();
let tags = || metric_tags!(COMPONENT => label);
Comment on lines +16 to +17

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fall back to component IDs for empty labels

On Linux systems where sysinfo falls back from hwmon to /sys/class/thermal (for example Raspberry Pi-style environments), component.label() is empty while component.id() contains the thermal-zone identifier. Using the empty label as the only component tag makes all temperature series share the same tag set when more than one thermal zone is present, so downstream aggregation can collapse distinct sensors; use the ID as a fallback when the label is empty.

Useful? React with 👍 / 👎.

if let Some(temperature) = component.temperature() {
output.gauge(TEMPERATURE_CELSIUS, temperature as f64, tags());
}
if let Some(max) = component.max() {
output.gauge(TEMPERATURE_MAX_CELSIUS, max as f64, tags());
Comment on lines +18 to +22

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drop NaN temperature readings

On Linux, sysinfo can return Some(f32::NAN) for temperature and max values when a sensor file exists but the read fails, and these branches emit that value as a normal gauge. In those sensor-error cases Vector will forward temperature_celsius/temperature_max_celsius samples with NaN values, which downstream metric sinks such as New Relic explicitly reject, so these readings should be filtered with is_finite() before creating metrics.

Useful? React with 👍 / 👎.

}
if let Some(critical) = component.critical() {
output.gauge(TEMPERATURE_CRITICAL_CELSIUS, critical as f64, tags());
}
}
}
}

#[cfg(test)]
mod tests {
use super::{
super::{HostMetrics, HostMetricsConfig, MetricsBuffer, tests::all_gauges},
COMPONENT,
};

#[tokio::test]
async fn generates_temperature_metrics() {
let mut buffer = MetricsBuffer::new(None);
HostMetrics::new(HostMetricsConfig::default())
.temperature_metrics(&mut buffer)
.await;
let metrics = buffer.metrics;

// Temperature sensors are not exposed in many environments (containers,
// virtual machines, CI runners), so the component list can legitimately
// be empty. When metrics are produced, they must all be gauges named
// `temperature*` and carry the `component` tag.
assert!(all_gauges(&metrics));
for metric in &metrics {
assert!(
metric.name().starts_with("temperature"),
"unexpected metric name: {}",
metric.name()
);
assert!(
metric
.tags()
.expect("temperature metric is missing tags")
.contains_key(COMPONENT),
"temperature metric is missing the `component` tag"
);
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -80,17 +80,18 @@ generated: components: sources: host_metrics: configuration: {

Only available on Linux.
"""
cpu: "Metrics related to CPU utilization."
disk: "Metrics related to disk I/O utilization."
filesystem: "Metrics related to filesystem space utilization."
host: "Metrics related to the host."
load: "Metrics related to the system load average."
memory: "Metrics related to memory utilization."
network: "Metrics related to network utilization."
process: "Metrics related to Process utilization."
tcp: "Metrics related to TCP connections."
cpu: "Metrics related to CPU utilization."
disk: "Metrics related to disk I/O utilization."
filesystem: "Metrics related to filesystem space utilization."
host: "Metrics related to the host."
load: "Metrics related to the system load average."
memory: "Metrics related to memory utilization."
network: "Metrics related to network utilization."
process: "Metrics related to Process utilization."
tcp: "Metrics related to TCP connections."
temperature: "Metrics related to component temperatures."
}
examples: ["cgroups", "cpu", "disk", "filesystem", "load", "host", "memory", "network", "tcp"]
examples: ["cgroups", "cpu", "disk", "filesystem", "load", "host", "memory", "network", "tcp", "temperature"]
}
}
}
Expand Down
17 changes: 17 additions & 0 deletions website/cue/reference/components/sources/host_metrics.cue
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,11 @@ components: sources: host_metrics: {
}
}

// Host temperature
temperature_celsius: _host & _temperature_gauge & {description: "The current temperature reported by a hardware component, in degrees Celsius."}
temperature_max_celsius: _host & _temperature_gauge & {description: "The highest temperature recorded for a hardware component, in degrees Celsius."}
temperature_critical_celsius: _host & _temperature_gauge & {description: "The temperature at which a hardware component is considered critical, in degrees Celsius."}

// Helpers
_host: {
default_namespace: "host"
Expand Down Expand Up @@ -307,5 +312,17 @@ components: sources: host_metrics: {
}
}
}

_temperature_gauge: {
type: "gauge"
tags: _host_metrics_tags & {
collector: examples: ["temperature"]
component: {
description: "The label of the hardware component the temperature was read from."
required: true
examples: ["Core 0", "coretemp Package id 0", "nvme Composite"]
}
}
}
}
}
Loading