Skip to content

Duplicate node_hwmon_temp_* values #3673

@champtar

Description

@champtar

The hwmon collector assumes each hwmon* points to a different device, which might be true most of the time but the kernel API let's you do it no problem

The initial bug report is against prometheus-node-exporter-lua, but the reporter (@vgropp) has the same issue with node_exporter:
openwrt/packages#28092 (comment)

# HELP node_hwmon_chip_names Annotation metric for human-readable chip names
# TYPE node_hwmon_chip_names gauge
node_hwmon_chip_names{chip="ieee80211_phy0",chip_name="mt7996_phy0_0"} 1
node_hwmon_chip_names{chip="ieee80211_phy0",chip_name="mt7996_phy0_1"} 1
node_hwmon_chip_names{chip="ieee80211_phy0",chip_name="mt7996_phy0_2"} 1
node_hwmon_chip_names{chip="platform_pwm_fan",chip_name="pwmfan"} 1
node_hwmon_chip_names{chip="platform_sfp2",chip_name="sfp2"} 1
node_hwmon_chip_names{chip="thermal_thermal_zone0",chip_name="cpu_thermal"} 1

# HELP node_hwmon_temp_celsius Hardware monitor for temperature (input)
# TYPE node_hwmon_temp_celsius gauge
node_hwmon_temp_celsius{chip="ieee80211_phy0",sensor="temp1"} 55
node_hwmon_temp_celsius{chip="platform_sfp2",sensor="temp1"} 57.09
node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp0"} 60.69
node_hwmon_temp_celsius{chip="thermal_thermal_zone0",sensor="temp1"} 60.552

and in sysout:

time=2026-05-26T16:09:18.951Z level=ERROR source=http.go:175 msg="error gathering metrics: 6 error(s) occurred:
* [from Gatherer #2] collected metric \"node_hwmon_temp_crit_celsius\" { label:{name:\"chip\" value:\"ieee80211_phy0\"} label:{name:\"sensor\" value:\"temp1\"} gauge:{value:110}} was collected before with the same name and label values
* [from Gatherer #2] collected metric \"node_hwmon_temp_celsius\" { label:{name:\"chip\" value:\"ieee80211_phy0\"} label:{name:\"sensor\" value:\"temp1\"} gauge:{value:55}} was collected before with the same name and label values
* [from Gatherer #2] collected metric \"node_hwmon_temp_max_celsius\" { label:{name:\"chip\" value:\"ieee80211_phy0\"} label:{name:\"sensor\" value:\"temp1\"} gauge:{value:120}} was collected before with the same name and label values
* [from Gatherer #2] collected metric \"node_hwmon_temp_crit_celsius\" { label:{name:\"chip\" value:\"ieee80211_phy0\"} label:{name:\"sensor\" value:\"temp1\"} gauge:{value:110}} was collected before with the same name and label values
* [from Gatherer #2] collected metric \"node_hwmon_temp_celsius\" { label:{name:\"chip\" value:\"ieee80211_phy0\"} label:{name:\"sensor\" value:\"temp1\"} gauge:{value:57}} was collected before with the same name and label values
* [from Gatherer #2] collected metric \"node_hwmon_temp_max_celsius\" { label:{name:\"chip\" value:\"ieee80211_phy0\"} label:{name:\"sensor\" value:\"temp1\"} gauge:{value:120}} was collected before with the same name and label values"

The mt7996 driver is upstream https://github.com/torvalds/linux/blob/e43ffb69e0438cddd72aaa30898b4dc446f664f8/drivers/net/wireless/mediatek/mt76/mt7996/init.c#L283, even if unusual (don't see any other bug report), reading the kernel doc not sure having 3 hwmon with temp1 instead of 1 hwmon with temp1/2/3 can be considered a bug (will send an email upstream for a second opinion)

Would it make sense to remove the node_hwmon_chip_names metric and always have chip and chip_name in the labels ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions