问题描述
压测Atlas 300l A2(32G)卡闪现 inspection failed告警然后消失
重现步骤
-
压测echo -e “Y\n” | ascend-dmi -p -dur 604800 -it 5 -pm history>> /home/workspace/stress.log &
-
bmc上出现告警
-
告警消失
期望结果
不产生告警
实际结果
尝试过的解决方案
查看日志,NPU功耗获取失败,不清楚这个功耗为什么获取失败然后又恢复
2026-02-18 21:58:18.825103 event NOTICE: events.lua(110): System major count change 0 to 1 by [Event_InspectFail_01010509].
2026-02-18 21:58:18.826424 event NOTICE: events.lua(127): System Health change Normal to Major.
2026-02-18 21:58:18.835549 chassis NOTICE: led_policy_service.lua(82): get_event_health change to: Major
2026-02-18 21:58:18.854870 event NOTICE: hardware_event.lua(580): Event_InspectFail_01010509|{“source”:{“expressions”:[“expr((($1 & 4) == 0) && (($2 & 32768) == 0) ? 0 : 1)”],“properties”:[{“Service”:“bmc.kepler.compute”,“Property”:“FaultState”,“Interface”:“bmc.kepler.Systems.NPUCard”,“Path”:“/bmc/kepler/Systems/1/PCIeDevices/PCIeCards/NPUCards/NPUCard_1_01010509”},{“Service”:“bmc.kepler.compute”,“Property”:“PowerWatts”,“Interface”:“bmc.kepler.Systems.NPUCard”,“Path”:“/bmc/kepler/Systems/1/PCIeDevices/PCIeCards/NPUCards/NPUCard_1_01010509”}]},“value”:[0,65535],“type”:“synchronization”}
2026-02-18 21:58:19.012790 event NOTICE: abstract_event.lua(241): [Event_InspectFail_01010509] generate an event [assert] while reading change to [1].
2026-02-18 21:58:19.020750 redfish NOTICE: alarm.lua(600): received a event[26-0].
2026-02-18 21:58:19.020953 event_policy NOTICE: synchronizer.lua(253): received a event[26-0].
2026-02-18 21:58:19.184263 chassis NOTICE: led_service.lua(445): set SysHealLed state success, current state: 5, Capability: 2
2026-02-18 21:58:27.391932 event NOTICE: events.lua(110): System major count change 1 to 0 by [Event_InspectFail_01010509].
2026-02-18 21:58:27.396067 event NOTICE: events.lua(127): System Health change Major to Normal.
2026-02-18 21:58:27.405413 chassis NOTICE: led_policy_service.lua(82): get_event_health change to: Normal
2026-02-18 21:58:27.423189 event NOTICE: hardware_event.lua(580): Event_InspectFail_01010509|{“source”:{“expressions”:[“expr((($1 & 4) == 0) && (($2 & 32768) == 0) ? 0 : 1)”],“properties”:[{“Service”:“bmc.kepler.compute”,“Property”:“FaultState”,“Interface”:“bmc.kepler.Systems.NPUCard”,“Path”:“/bmc/kepler/Systems/1/PCIeDevices/PCIeCards/NPUCards/NPUCard_1_01010509”},{“Service”:“bmc.kepler.compute”,“Property”:“PowerWatts”,“Interface”:“bmc.kepler.Systems.NPUCard”,“Path”:“/bmc/kepler/Systems/1/PCIeDevices/PCIeCards/NPUCards/NPUCard_1_01010509”}]},“value”:[0,3506],“type”:“synchronization”}
2026-02-18 21:58:27.524207 event NOTICE: abstract_event.lua(241): [Event_InspectFail_01010509] generate an event [deassert] while reading change to [0].
2026-02-18 21:58:27.533649 event_policy NOTICE: synchronizer.lua(253): received a event[27-0].

