Atlas 300I Duo卡在系统DC Cycle过程中会出现0x0800007D的告警

DC Cycle重启系统,BMC上会出现0x0800007D,The PCIe card 5 () voltage is too low的告警,2秒解除。
该现象基本必现。

SEL日志信息如下:
583 |2025-07-17 02:14:54 |Major |0x0800007E |Deasserted |The PCIe card 5 () voltage is too low. Error code: 7587.
582 |2025-07-17 02:14:52 |Major |0x0800007D |Asserted |The PCIe card 5 () voltage is too low. Error code: 7587.
581 |2025-07-17 02:14:51 |Normal |0x2C000009 |Asserted |ACPI is in the working state.
580 |2025-07-17 02:14:51 |Normal |0x2C000011 |Asserted |The host was restarted by command.
579 |2025-07-17 02:14:36 |Normal |0x2C00000B |Asserted |ACPI is in the soft-off state.
578 |2025-07-17 02:09:29 |Major |0x0800007E |Deasserted |The PCIe card 5 () voltage is too low. Error code: 7587.
577 |2025-07-17 02:09:28 |Normal |0x2C000009 |Asserted |ACPI is in the working state.
576 |2025-07-17 02:09:27 |Major |0x0800007D |Asserted |The PCIe card 5 () voltage is too low. Error code: 7587.
575 |2025-07-17 02:09:27 |Normal |0x2C000011 |Asserted |The host was restarted by command.
574 |2025-07-17 02:09:08 |Normal |0x2C00000B |Asserted |ACPI is in the soft-off state.
573 |2025-07-17 02:04:00 |Major |0x0800007E |Deasserted |The PCIe card 5 () voltage is too low. Error code: 7563, 7565, 7567, 7569, 7571, 7573, 7575, 7579, 7581, 7583, 7585, 7587, 7589, 7591.
572 |2025-07-17 02:03:58 |Normal |0x2C000009 |Asserted |ACPI is in the working state.
571 |2025-07-17 02:03:58 |Major |0x0800007D |Asserted |The PCIe card 5 () voltage is too low. Error code: 7563, 7565, 7567, 7569, 7571, 7573, 7575, 7579, 7581, 7583, 7585, 7587, 7589, 7591.
570 |2025-07-17 02:03:58 |Normal |0x2C000011 |Asserted |The host was restarted by command.
569 |2025-07-17 02:03:43 |Normal |0x2C00000B |Asserted |ACPI is in the soft-off state.
568 |2025-07-17 01:58:37 |Normal |0x2C000009 |Asserted |ACPI is in the working state.
567 |2025-07-17 01:58:36 |Normal |0x2C000011 |Asserted |The host was restarted by command.
566 |2025-07-17 01:58:21 |Normal |0x2C00000B |Asserted |ACPI is in the soft-off state.
565 |2025-07-17 01:53:17 |Major |0x0800007E |Deasserted |The PCIe card 5 () voltage is too low. Error code: 7587.
564 |2025-07-17 01:53:15 |Normal |0x2C000009 |Asserted |ACPI is in the working state.
563 |2025-07-17 01:53:15 |Normal |0x2C000011 |Asserted |The host was restarted by command.
562 |2025-07-17 01:53:15 |Major |0x0800007D |Asserted |The PCIe card 5 () voltage is too low. Error code: 7587.
561 |2025-07-17 01:53:00 |Normal |0x2C00000B |Asserted |ACPI is in the soft-off state.
560 |2025-07-17 01:47:53 |Normal |0x2C000009 |Asserted |ACPI is in the working state.
559 |2025-07-17 01:47:53 |Normal |0x2C000011 |Asserted |The host was restarted by command.


Atlas 300I Duo卡的maintenance_log_20250717021605.bin日志如下:

2025-07-17 1:47:36 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:47:36 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 1.
2025-07-17 1:47:36 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0xe3.
2025-07-17 1:47:37 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:47:37 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xfd.
2025-07-17 1:47:51 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0x73.
2025-07-17 1:53:0 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:53:0 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 1.
2025-07-17 1:53:0 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:53:2 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0xe3.
2025-07-17 1:53:2 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xfd.
2025-07-17 1:53:14 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0x73.
2025-07-17 1:53:14 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xff.
2025-07-17 1:53:14 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 0.
2025-07-17 1:53:14 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:53:14 MAJOR:[voltage_monitor:584] HVCC_1V2 Volt current value is 0(0.01V).
2025-07-17 1:53:14 NORMAL:[voltage_monitor:584] HVCC_1V2 Volt current value is 119(0.01V).
2025-07-17 1:53:40 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:53:40 NORMAL:[write_vrd_version_to_log:173] MPS_AI_M version: 7.MPS_CPU_M version: 4.MPS_AI_S version: 7.MPS_CPU_S version: 4.
2025-07-17 1:56:13 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 67.
2025-07-17 1:56:13 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 1:56:16 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 69.
2025-07-17 1:56:16 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:67, :0.
2025-07-17 1:56:21 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 69.
2025-07-17 1:56:21 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 1:56:24 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 66.
2025-07-17 1:56:24 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:57, T_CORE_M:69, T_CORE_S:32767, :0.
2025-07-17 1:58:20 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:58:20 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 1.
2025-07-17 1:58:20 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:58:23 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0xe3.
2025-07-17 1:58:23 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xfd.
2025-07-17 1:58:35 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0x73.
2025-07-17 1:58:35 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xff.
2025-07-17 1:58:35 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 0.
2025-07-17 1:58:35 MAJOR:[voltage_monitor:584] HVCC_1V2 Volt current value is 0(0.01V).
2025-07-17 1:58:35 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:58:35 NORMAL:[voltage_monitor:584] HVCC_1V2 Volt current value is 119(0.01V).
2025-07-17 1:59:2 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:1:35 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 70.
2025-07-17 2:1:35 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 2:1:36 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 69.
2025-07-17 2:1:36 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:70, T_CORE_S:32767, :0.
2025-07-17 2:1:37 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 69.
2025-07-17 2:1:37 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:70, T_CORE_S:32767, :0.
2025-07-17 2:1:44 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 68.
2025-07-17 2:1:44 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:70, T_CORE_S:32767, :0.
2025-07-17 2:1:47 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 68.
2025-07-17 2:1:47 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 2:1:48 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 69.
2025-07-17 2:1:48 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:32767, T_CORE_S:68, :0.
2025-07-17 2:1:50 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 67.
2025-07-17 2:1:50 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 2:1:51 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 69.
2025-07-17 2:9:7 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:9:7 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 1.
2025-07-17 2:9:8 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:9:9 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0xe3.
2025-07-17 2:9:9 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xfd.
2025-07-17 2:9:26 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 0.
2025-07-17 2:9:26 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0x73.
2025-07-17 2:9:26 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xff.
2025-07-17 2:9:26 MAJOR:[voltage_monitor:584] HVCC_1V2_S Volt current value is 0(0.01V).
2025-07-17 2:9:26 MAJOR:[voltage_monitor:584] HVCC_1V2 Volt current value is 0(0.01V).
2025-07-17 2:9:26 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:9:26 NORMAL:[voltage_monitor:584] HVCC_1V2_S Volt current value is 119(0.01V).
2025-07-17 2:9:27 NORMAL:[voltage_monitor:584] HVCC_1V2 Volt current value is 119(0.01V).
2025-07-17 2:9:52 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:9:53 NORMAL:[write_vrd_version_to_log:173] MPS_AI_M version: 7.MPS_CPU_M version: 4.MPS_AI_S version: 7.MPS_CPU_S version: 4.
2025-07-17 2:12:29 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 66.
2025-07-17 2:12:29 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 2:12:29 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 68.
2025-07-17 2:12:29 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:66, :0.
2025-07-17 2:12:36 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 68.
2025-07-17 2:12:36 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 2:12:37 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 68.
2025-07-17 2:12:37 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 2:12:37 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 66.
2025-07-17 2:12:37 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:68, T_CORE_S:32767, :0.
2025-07-17 2:14:36 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:14:36 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 1.
2025-07-17 2:14:36 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0xe3.
2025-07-17 2:14:36 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:14:36 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xfd.
2025-07-17 2:14:49 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0x73.
2025-07-17 2:14:49 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xff.
2025-07-17 2:14:49 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 0.
2025-07-17 2:14:49 MAJOR:[voltage_monitor:584] HVCC_1V2 Volt current value is 0(0.01V).
2025-07-17 2:14:49 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:14:51 NORMAL:[voltage_monitor:584] HVCC_1V2 Volt current value is 119(0.01V).
2025-07-17 2:15:16 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.

通过日志来看,出现事件告警的时候,HVCC_1V2 检测为0
MAJOR:[voltage_monitor:584] HVCC_1V2 Volt current value is 0(0.01V).
出现告警之前,都有pmu reset 的打印信息。

告警配置是什么

告警配置是在vpd\vendor\Huawei\Gpu\14140130_19e5d500_02000110.sr文件里面,是这样定义的
“Event_VoltLower”: {
“EventKeyId”: “PcieCard.PCIeCardVoltageLower”,
“Reading”: “<=/NPUCard_1.FaultState |> expr(($1 & 1024) == 0 ? 0 : 1)”,
“Condition”: 1,
“OperatorId”: 5,
“Component”: “#/Component_PCIeCard”,
“DescArg2”: “#/PCIeDevice_1.SlotID”,
“DescArg3”: “#/PCIeCard_1.BoardName”,
“DescArg4”: “#/NPUCard_1.FaultCode”
},

查看告警触发时的alarm.log日志,对应有reading值打印

Event_VoltLower_0101010902:[31:10–1][31:12–0][36:30–1][36:33–0][52:46–1][52:48–0][08:51–1][08:53–0][14:12–1][14:15–0]
Event_VoltLower_0101010902:[35:43–1][35:45–0][41:07–1][41:09–0][02:41–1][02:43–0][08:02–1][08:04–0][13:25–1][13:27–0]
Event_VoltLower_0101010902:[34:52–1][34:54–0][51:00–1][51:02–0][56:21–1][56:24–0][07:05–1][07:07–0][12:28–1][12:30–0]
Event_VoltLower_0101010902:[50:21–1][50:24–0][55:42–1][55:44–0][01:06–1][01:09–0][11:50–1][11:53–0][17:13–1][17:19–0]
Event_VoltLower_0101010902:[49:45–1][49:48–0][55:08–1][55:11–0][00:34–1][00:36–0][05:59–1][06:01–0][16:41–1][16:43–0]
Event_VoltLower_0101010902:[27:33–1][27:35–0][32:54–1][32:56–0][38:21–1][38:25–0][43:47–1][43:49–0][49:08–1][49:10–0]
Event_VoltLower_0101010902:[37:31–1][37:33–0][48:28–1][48:30–0][53:55–1][53:57–0][04:39–1][04:41–0][10:01–1][10:03–0]
Event_VoltLower_0101010902:[42:30–1][42:32–0][53:15–1][53:17–0][03:58–1][04:00–0][09:27–1][09:29–0][14:52–1][14:54–0]

请问这个问题后来有解决方案吗?我们现在也遇到同样的问题。

+1我们也有同样的问题

+1 我们也有同样问题

问题还未有解决方案,目前怀疑是与Atlas盘自身传的数据相关,目前做过的尝试是升级卡的MCU版本到官网最新的版本,现象有明显好转,DC99次仅出现2次告警信息,但是问题依然存在。

目前通过在Event_VoltUpper和Event_VoltLower中增加Entity_GPUCard.PowerState的判断,只在上电状态下检测告警,可以解决此类问题。

    "Event_VoltUpper": {
        "EventKeyId": "PcieCard.PCIeCardVoltageUpper",
        "Reading": "<=/NPUCard_1.FaultState;<=/Entity_GPUCard.PowerState |> expr((($1 & 512) || (($2) == 0)) == 0 ? 0 : 1)",
        "Condition": 1,
        "OperatorId": 5,
        "Component": "#/Component_PCIeCard",
        "DescArg2": "#/PCIeDevice_1.SlotID",
        "DescArg3": "#/PCIeCard_1.BoardName",
        "DescArg4": "#/NPUCard_1.FaultCode"
    },
    "Event_VoltLower": {
        "EventKeyId": "PcieCard.PCIeCardVoltageLower",
        "Reading": "<=/NPUCard_1.FaultState;<=/Entity_GPUCard.PowerState |> expr((($1 & 1024) || (($2) == 0)) == 0 ? 0 : 1)",
        "Condition": 1,
        "OperatorId": 5,
        "Component": "#/Component_PCIeCard",
        "DescArg2": "#/PCIeDevice_1.SlotID",
        "DescArg3": "#/PCIeCard_1.BoardName",
        "DescArg4": "#/NPUCard_1.FaultCode"
    },

“Entity_GPUCard”: {
“Id”: 11,
“Name”: “PCIe Card”,
“PowerState”: 1,
“Presence”: 1,
“Instance”: 101
}, 从SR中看到PowerState被固定写死为1,你这个表达式应该是无法起到在只有上电时进行检测的,且假设当PowerState为0的话,表达式结果恒为1,应该不符合预期目标。

PowerState为0的话,表达式结果为0.目前这个修改,实测是有效果的。

补充一下,我们实际在测试过程当中,发现除了300I Duo,300vpro也会出现该问题,这是否应该属于一个共性问题?