DC Cycle重启系统,BMC上会出现0x0800007D,The PCIe card 5 () voltage is too low的告警,2秒解除。
该现象基本必现。
SEL日志信息如下:
583 |2025-07-17 02:14:54 |Major |0x0800007E |Deasserted |The PCIe card 5 () voltage is too low. Error code: 7587.
582 |2025-07-17 02:14:52 |Major |0x0800007D |Asserted |The PCIe card 5 () voltage is too low. Error code: 7587.
581 |2025-07-17 02:14:51 |Normal |0x2C000009 |Asserted |ACPI is in the working state.
580 |2025-07-17 02:14:51 |Normal |0x2C000011 |Asserted |The host was restarted by command.
579 |2025-07-17 02:14:36 |Normal |0x2C00000B |Asserted |ACPI is in the soft-off state.
578 |2025-07-17 02:09:29 |Major |0x0800007E |Deasserted |The PCIe card 5 () voltage is too low. Error code: 7587.
577 |2025-07-17 02:09:28 |Normal |0x2C000009 |Asserted |ACPI is in the working state.
576 |2025-07-17 02:09:27 |Major |0x0800007D |Asserted |The PCIe card 5 () voltage is too low. Error code: 7587.
575 |2025-07-17 02:09:27 |Normal |0x2C000011 |Asserted |The host was restarted by command.
574 |2025-07-17 02:09:08 |Normal |0x2C00000B |Asserted |ACPI is in the soft-off state.
573 |2025-07-17 02:04:00 |Major |0x0800007E |Deasserted |The PCIe card 5 () voltage is too low. Error code: 7563, 7565, 7567, 7569, 7571, 7573, 7575, 7579, 7581, 7583, 7585, 7587, 7589, 7591.
572 |2025-07-17 02:03:58 |Normal |0x2C000009 |Asserted |ACPI is in the working state.
571 |2025-07-17 02:03:58 |Major |0x0800007D |Asserted |The PCIe card 5 () voltage is too low. Error code: 7563, 7565, 7567, 7569, 7571, 7573, 7575, 7579, 7581, 7583, 7585, 7587, 7589, 7591.
570 |2025-07-17 02:03:58 |Normal |0x2C000011 |Asserted |The host was restarted by command.
569 |2025-07-17 02:03:43 |Normal |0x2C00000B |Asserted |ACPI is in the soft-off state.
568 |2025-07-17 01:58:37 |Normal |0x2C000009 |Asserted |ACPI is in the working state.
567 |2025-07-17 01:58:36 |Normal |0x2C000011 |Asserted |The host was restarted by command.
566 |2025-07-17 01:58:21 |Normal |0x2C00000B |Asserted |ACPI is in the soft-off state.
565 |2025-07-17 01:53:17 |Major |0x0800007E |Deasserted |The PCIe card 5 () voltage is too low. Error code: 7587.
564 |2025-07-17 01:53:15 |Normal |0x2C000009 |Asserted |ACPI is in the working state.
563 |2025-07-17 01:53:15 |Normal |0x2C000011 |Asserted |The host was restarted by command.
562 |2025-07-17 01:53:15 |Major |0x0800007D |Asserted |The PCIe card 5 () voltage is too low. Error code: 7587.
561 |2025-07-17 01:53:00 |Normal |0x2C00000B |Asserted |ACPI is in the soft-off state.
560 |2025-07-17 01:47:53 |Normal |0x2C000009 |Asserted |ACPI is in the working state.
559 |2025-07-17 01:47:53 |Normal |0x2C000011 |Asserted |The host was restarted by command.
Atlas 300I Duo卡的maintenance_log_20250717021605.bin日志如下:
2025-07-17 1:47:36 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:47:36 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 1.
2025-07-17 1:47:36 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0xe3.
2025-07-17 1:47:37 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:47:37 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xfd.
2025-07-17 1:47:51 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0x73.
2025-07-17 1:53:0 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:53:0 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 1.
2025-07-17 1:53:0 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:53:2 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0xe3.
2025-07-17 1:53:2 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xfd.
2025-07-17 1:53:14 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0x73.
2025-07-17 1:53:14 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xff.
2025-07-17 1:53:14 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 0.
2025-07-17 1:53:14 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:53:14 MAJOR:[voltage_monitor:584] HVCC_1V2 Volt current value is 0(0.01V).
2025-07-17 1:53:14 NORMAL:[voltage_monitor:584] HVCC_1V2 Volt current value is 119(0.01V).
2025-07-17 1:53:40 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:53:40 NORMAL:[write_vrd_version_to_log:173] MPS_AI_M version: 7.MPS_CPU_M version: 4.MPS_AI_S version: 7.MPS_CPU_S version: 4.
2025-07-17 1:56:13 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 67.
2025-07-17 1:56:13 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 1:56:16 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 69.
2025-07-17 1:56:16 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:67, :0.
2025-07-17 1:56:21 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 69.
2025-07-17 1:56:21 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 1:56:24 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 66.
2025-07-17 1:56:24 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:57, T_CORE_M:69, T_CORE_S:32767, :0.
2025-07-17 1:58:20 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:58:20 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 1.
2025-07-17 1:58:20 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:58:23 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0xe3.
2025-07-17 1:58:23 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xfd.
2025-07-17 1:58:35 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0x73.
2025-07-17 1:58:35 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xff.
2025-07-17 1:58:35 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 0.
2025-07-17 1:58:35 MAJOR:[voltage_monitor:584] HVCC_1V2 Volt current value is 0(0.01V).
2025-07-17 1:58:35 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 1:58:35 NORMAL:[voltage_monitor:584] HVCC_1V2 Volt current value is 119(0.01V).
2025-07-17 1:59:2 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:1:35 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 70.
2025-07-17 2:1:35 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 2:1:36 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 69.
2025-07-17 2:1:36 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:70, T_CORE_S:32767, :0.
2025-07-17 2:1:37 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 69.
2025-07-17 2:1:37 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:70, T_CORE_S:32767, :0.
2025-07-17 2:1:44 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 68.
2025-07-17 2:1:44 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:70, T_CORE_S:32767, :0.
2025-07-17 2:1:47 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 68.
2025-07-17 2:1:47 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 2:1:48 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 69.
2025-07-17 2:1:48 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:32767, T_CORE_S:68, :0.
2025-07-17 2:1:50 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 67.
2025-07-17 2:1:50 MAJOR:[record_board_temp:2425] T_LM75A:48, T_LM75B:59, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 2:1:51 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 69.
2025-07-17 2:9:7 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:9:7 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 1.
2025-07-17 2:9:8 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:9:9 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0xe3.
2025-07-17 2:9:9 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xfd.
2025-07-17 2:9:26 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 0.
2025-07-17 2:9:26 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0x73.
2025-07-17 2:9:26 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xff.
2025-07-17 2:9:26 MAJOR:[voltage_monitor:584] HVCC_1V2_S Volt current value is 0(0.01V).
2025-07-17 2:9:26 MAJOR:[voltage_monitor:584] HVCC_1V2 Volt current value is 0(0.01V).
2025-07-17 2:9:26 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:9:26 NORMAL:[voltage_monitor:584] HVCC_1V2_S Volt current value is 119(0.01V).
2025-07-17 2:9:27 NORMAL:[voltage_monitor:584] HVCC_1V2 Volt current value is 119(0.01V).
2025-07-17 2:9:52 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:9:53 NORMAL:[write_vrd_version_to_log:173] MPS_AI_M version: 7.MPS_CPU_M version: 4.MPS_AI_S version: 7.MPS_CPU_S version: 4.
2025-07-17 2:12:29 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 66.
2025-07-17 2:12:29 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 2:12:29 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 68.
2025-07-17 2:12:29 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:66, :0.
2025-07-17 2:12:36 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 68.
2025-07-17 2:12:36 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 2:12:37 NORMAL:[D_chip_temp_monitor:679] AICORE_M_temp temp current value is 68.
2025-07-17 2:12:37 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:32767, T_CORE_S:32767, :0.
2025-07-17 2:12:37 NORMAL:[D_chip_temp_monitor:679] AICORE_S_temp temp current value is 66.
2025-07-17 2:12:37 MAJOR:[record_board_temp:2425] T_LM75A:47, T_LM75B:57, T_CORE_M:68, T_CORE_S:32767, :0.
2025-07-17 2:14:36 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:14:36 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 1.
2025-07-17 2:14:36 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0xe3.
2025-07-17 2:14:36 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:14:36 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xfd.
2025-07-17 2:14:49 MINOR:[accessor_read_mux_signal:2132] mux signal 0 occured 0x73.
2025-07-17 2:14:49 MINOR:[accessor_read_mux_signal:2132] mux signal 1 occured 0xff.
2025-07-17 2:14:49 NORMAL:[pmu_reset_event_handle:316] pmu reset occured 0.
2025-07-17 2:14:49 MAJOR:[voltage_monitor:584] HVCC_1V2 Volt current value is 0(0.01V).
2025-07-17 2:14:49 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
2025-07-17 2:14:51 NORMAL:[voltage_monitor:584] HVCC_1V2 Volt current value is 119(0.01V).
2025-07-17 2:15:16 NORMAL:[dev_out_reset_event_handle:310] dev_out reset occured.
通过日志来看,出现事件告警的时候,HVCC_1V2 检测为0
MAJOR:[voltage_monitor:584] HVCC_1V2 Volt current value is 0(0.01V).
出现告警之前,都有pmu reset 的打印信息。