// 此模板仅供参考,如果不适用可以修改
问题描述
OCP网卡通知热拔后,近端不拔出网卡,等待20分钟,BMC界面操作强制下单再上电,BMC概率出现获取网卡温度失败告警
告警配置如下
"Event_TempFail": {
"EventKeyId": "PcieCard.PCIeCardTempFail",
"Condition": 1,
"Reading": "<=/Scanner_NIC_Lm75.Status; <=/Scanner_CardPowerGood.Value |> expr($2 == 0 ? 0 : $1)",
"OperatorId": 5,
"Enabled": true,
"Component": "#/Component_OCPCard",
"AdditionalInfo": "2",
"DescArg2": "#/Component_OCPCard.Name",
"DescArg3": "NICCard Chip"
}
补充
网卡电源值来源,扩展板CPLD寄存器如下
"Scanner_CardPowerGood": {
"Chip": "#/Smc_ExpBoardSMC",
"Offset": 469771520,
"Size": 2,
"Mask": "${Slot} |> expr(($1 == 1) ? 2 : 512)",
"Type": 0,
"Period": 100,
"Debounce": "None",
"Value": 0
}
Scanner_NIC_Lm75 status表示温度传感器是否可以访问通
"Chip_NICTempChip": {
"Address": 62,
"AddrWidth": 1,
"OffsetWidth": 1,
"WriteTmout": 100,
"ReadTmout": 100,
"HealthStatus": 0
},
"Scanner_NIC_Lm75": {
"Chip": "#/Chip_NICTempChip",
"Size": 1,
"Offset": 1,
"Mask": 255,
"Period": 1000,
"Type": 0,
"Debounce": "#/MidAvg_NICTemp"
}
之前跟硬件确认过,通知热拔后,网卡电源会完全断掉,因此配置了条件,网卡电源状态为0时,无法访问lm75不进行告警,但当在我上述描述的场景(OCP网卡通知热拔后,近端不拔出网卡,等待20分钟)下,强制下电后,Scanner_CardPowerGood会先变为1,随后Scanner_NIC_Lm75 status才会变为0(即可以通过i2c访问通lm75),因此会产生一段时间的告警,其中时间差为50s左右
Event Event_TempFail_010108 Reading: 0 ["expr($2 == 0 ? 0 : $1)"]
$1: bmc.kepler.hwproxy /bmc/kepler/Scanner/Scanner_NIC_Lm75_010108 bmc.kepler.Scanner Status [["2026-03-19 04:52:23","sig_properties_changed",2],["2026-03-19 04:57:25","sig_properties_changed",1],["2026-03-19 05:45:38","sig_properties_changed",0]]
$2: bmc.kepler.hwproxy /bmc/kepler/Scanner/Scanner_CardPowerGood_010108 bmc.kepler.Scanner Value [["2026-03-19 04:31:25","fetch",1],["2026-03-19 04:52:22","sig_properties_changed",0],["2026-03-19 05:45:26","sig_properties_changed",1]]
如果是点击热插拔后3分钟即执行强制下电再上电,则Scanner_CardPowerGood会先变为1,随后Scanner_NIC_Lm75 status变为0,即可以通过i2c访问通lm75,中间间隔10s左右
Event Event_TempFail_010108 Reading: 0 ["expr($2 == 0 ? 0 : $1)"]
$1: bmc.kepler.hwproxy /bmc/kepler/Scanner/Scanner_NIC_Lm75_010108 bmc.kepler.Scanner Status [["2026-03-20 08:17:21","fetch",0],["2026-03-20 08:31:24","sig_properties_changed",2],["2026-03-20 08:33:26","sig_properties_changed",0]]
$2: bmc.kepler.hwproxy /bmc/kepler/Scanner/Scanner_CardPowerGood_010108 bmc.kepler.Scanner Value [["2026-03-20 08:17:21","fetch",1],["2026-03-20 08:31:23","sig_properties_changed",0],["2026-03-20 08:33:17","sig_properties_changed",1]]
环境信息
- 软件版本:OpenUBMC LTS SP1
重现步骤
-
网卡页面点击OCP卡热插拔
-
等待20分钟
-
强制下电再上电
期望结果
不产生任何告警
如果相差时间较短,BMC可以考虑加防抖,但是相差50s,需要硬件进一步分析
