问题描述:
插拔电源、风扇期间概率性触发内存UCE错误,OS挂死。上报告警日志如下:
DIMM101(CpuBoard1 DIMM101) triggered an uncorrectable error.
DIMM100(CpuBoard1 DIMM100) triggered an uncorrectable error.
相关app.log日志如下:
2026-04-27 06:46:00.408010 unknown_service ERROR: bt.lua(84): bt[0], ioctl(BT_CMD_WRITE) failed: Connection timed out, val: , ret: -1
2026-04-27 06:46:02.216935 fault_diagnosis NOTICE: fpc_diagnose.lua(566): [IFMM]: memory fault diagnosed, final_type: 0004
2026-04-27 06:46:02.221436 fault_diagnosis NOTICE: soc2_ras.lua(361): fpc_process: need bank_adddc
2026-04-27 06:46:02.223111 fault_diagnosis NOTICE: soc2_ras.lua(368): fpc_process: ADDDC is disable
2026-04-27 06:46:03.410722 unknown_service ERROR: bt.lua(84): bt[0], ioctl(BT_CMD_WRITE) failed: Connection timed out, val: , ret: -1
2026-04-27 06:46:03.704724 fault_diagnosis NOTICE: fpc_diagnose.lua(566): [IFMM]: memory fault diagnosed, final_type: 0004
2026-04-27 06:46:03.723699 fault_diagnosis NOTICE: soc2_ras.lua(361): fpc_process: need bank_adddc
2026-04-27 06:46:03.726604 fault_diagnosis NOTICE: soc2_ras.lua(368): fpc_process: ADDDC is disable
2026-04-27 06:46:03.871048 pcie_device ERROR: device_service.lua(1117): method_get_device_name fail
2026-04-27 06:46:03.873602 fault_diagnosis ERROR: silk.lua(156): get pcie silk fail: 1
2026-04-27 06:46:04.505892 ras ERROR: fdm_arm_address_translate.c(2611): [system_physical_addr_to_ddr_addr]input null or invalid ddr_addr.
2026-04-27 06:46:04.665898 pcie_device ERROR: device_service.lua(1117): method_get_device_name fail
2026-04-27 06:46:04.667286 fault_diagnosis ERROR: silk.lua(156): get pcie silk fail: 1
2026-04-27 06:46:05.251956 unknown_service ERROR: bt.lua(84): bt[0], ioctl(BT_CMD_WRITE) failed: Connection timed out, val: , ret: -1
2026-04-27 06:46:08.802588 fault_diagnosis NOTICE: fpc_diagnose.lua(566): [IFMM]: memory fault diagnosed, final_type: 0004
2026-04-27 06:46:08.811155 fault_diagnosis NOTICE: soc2_ras.lua(361): fpc_process: need bank_adddc
2026-04-27 06:46:08.814226 fault_diagnosis NOTICE: soc2_ras.lua(368): fpc_process: ADDDC is disable
2026-04-27 06:46:08.911056 fault_diagnosis ERROR: memory_ras.lua(834): mem params is nil
2026-04-27 06:46:09.724211 fault_diagnosis NOTICE: fpc_diagnose.lua(566): [IFMM]: memory fault diagnosed, final_type: 0001
2026-04-27 06:46:09.728369 fault_diagnosis NOTICE: soc2_ras.lua(207): fpc_process: need pageoffline
2026-04-27 06:46:09.730810 fault_diagnosis NOTICE: soc2_ras.lua(68): fpc_process: pageoffline switch is disable
2026-04-27 06:46:09.970350 unknown_service ERROR: bt.lua(84): bt[0], ioctl(BT_CMD_WRITE) failed: Connection timed out, val: , ret: -1
2026-04-27 06:46:10.028104 fault_diagnosis NOTICE: fpc_diagnose.lua(566): [IFMM]: memory fault diagnosed, final_type: 0001
2026-04-27 06:46:10.032711 fault_diagnosis NOTICE: soc2_ras.lua(207): fpc_process: need pageoffline
2026-04-27 06:46:10.035540 fault_diagnosis NOTICE: soc2_ras.lua(68): fpc_process: pageoffline switch is disable
2026-04-27 06:46:11.623262 unknown_service ERROR: bt.lua(84): bt[0], ioctl(BT_CMD_WRITE) failed: Connection timed out, val: , ret: -1
2026-04-27 06:46:12.211067 ras ERROR: fdm_arm_address_translate.c(2611): [system_physical_addr_to_ddr_addr]input null or invalid ddr_addr.
2026-04-27 06:46:16.090080 fault_diagnosis ERROR: memory_ras.lua(834): mem params is nil
2026-04-27 06:46:16.166207 ras ERROR: fdm_arm_address_translate.c(2611): [system_physical_addr_to_ddr_addr]input null or invalid ddr_addr.
2026-04-27 06:46:17.189792 unknown_service ERROR: bt.lua(84): bt[0], ioctl(BT_CMD_WRITE) failed: Connection timed out, val: , ret: -1
2026-04-27 06:46:18.172522 redfish NOTICE: alarm.lua(600): received a event[124-0].
2026-04-27 06:46:19.290304 unknown_service ERROR: bt.lua(84): bt[0], ioctl(BT_CMD_WRITE) failed: Connection timed out, val: , ret: -1
2026-04-27 06:46:19.894319 ras ERROR: fdm_arm_address_translate.c(2611): [system_physical_addr_to_ddr_addr]input null or invalid ddr_addr.
2026-04-27 06:46:25.392695 event NOTICE: events.lua(114): System critical count change 0 to 1 by [Event_Memory101UCE_010101].
2026-04-27 06:46:25.398168 event NOTICE: events.lua(127): System Health change Minor to Critical.
2026-04-27 06:46:25.448332 event NOTICE: hardware_event.lua(570): Event_Memory101UCE_010101|{“source”:{“properties”:[{“Path”:“/bmc/kepler/Systems/1/Memory/Memory_10_010101”,“Service”:“bmc.kepler.compute”,“Interface”:“bmc.kepler.Systems.Memory”,“Property”:“DiagnosticFault”}]},“value”:[1],“type”:“synchronization”}
想咨询一下ras上报的具体原理是什么?ras故障是否是bios通过bt传过来的?bios传过来的信息是怎么和DIMM进行对应的,日志中没有看到具体的CPU编号、channal编号、dimm编号的打印。