DC稳定性测试中概率出现无法获取Raid卡温度和Disks Temp温度、无法获取网卡温度

// 此模板仅供参考,如果不适用可以修改

问题描述

DC稳定性测试中概率出现无法获取Raid卡温度和Disks Temp温度、无法获取网卡温度

环境信息

  • 操作系统:[如 Ubuntu 24.04]

  • 软件版本:OpenUBMC2512

  • 硬件配置:[如 CPU、内存等]

重现步骤

  1. 在系统下执行Power Cycle命令

  2. 系统启动完成进入系统

  3. 查询传感器读值

期望结果

传感器无异常

实际结果

第2次power cycle时,无法获取Raid卡温度和Disks Temp温度,并且有无法获取网卡温度告警

图片

图片

怀疑链路出现问题

app日志有如下打印,get_pd_list failed ret = 4098

图片

2026-02-26 11:24:28.036599 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:24:28.036650 hardware ERROR: hs_pd.c(102): histore_get_ctrl_pd_list get_pd_list failed ret = 4098
2026-02-26 11:24:28.038104 storage ERROR: controller_object.lua(947): get_ctrl_pd_list failed and return ./opt/bmc/apps/storage/lualib/sml/init.lua:79: 4098
2026-02-26 11:24:38.997641 general_hardware NOTICE: fructl_handler.lua(76): get_power_state: system[1] get power power ON
2026-02-26 11:24:53.073144 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:24:53.073194 hardware ERROR: hs_ctrl.c(1308): get_bbu_info failed, ret = 4098
2026-02-26 11:24:58.200837 unknown_service NOTICE: write_service.lua(33): mctp_write_service failed, count: 4, err: write failed, fd = 23, Timer expired
2026-02-26 11:25:18.093377 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:25:18.093538 hardware ERROR: hs_pd.c(102): histore_get_ctrl_pd_list get_pd_list failed ret = 4098
2026-02-26 11:25:18.094814 storage ERROR: controller_object.lua(947): get_ctrl_pd_list failed and return ./opt/bmc/apps/storage/lualib/sml/init.lua:79: 4098
2026-02-26 11:25:19.219253 general_hardware NOTICE: fructl_handler.lua(76): get_power_state: system[1] get power power ON
2026-02-26 11:25:28.402880 unknown_service NOTICE: write_service.lua(33): mctp_write_service failed, count: 3, err: write failed, fd = 23, Timer expired [repeated 5 times in 297s from 2026-02-26 11:19:15.480722 to 2026-02-26 11:24:12.840341][flush]
2026-02-26 11:25:32.291580 bmc_network NOTICE: dhcp_process.lua(1255): the field composed by dhcpv6 vendor class is empty string
2026-02-26 11:25:32.295989 bmc_network NOTICE: dhcp_process.lua(1016): eth2: no available prefix info, send_rs_and_parse_ra now
2026-02-26 11:25:43.111432 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:25:43.111484 hardware ERROR: hs_ctrl.c(385): histore_get_ctrl_info get_ctrl_info failed, ret = 4098
2026-02-26 11:25:59.287451 general_hardware NOTICE: fructl_handler.lua(76): get_power_state: system[1] get power power ON
2026-02-26 11:26:08.135842 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:26:33.160625 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:26:33.160674 hardware ERROR: hs_pd.c(102): histore_get_ctrl_pd_list get_pd_list failed ret = 4098
2026-02-26 11:26:33.161477 storage ERROR: controller_object.lua(947): get_ctrl_pd_list failed and return ./opt/bmc/apps/storage/lualib/sml/init.lua:79: 4098
2026-02-26 11:26:39.347815 general_hardware NOTICE: fructl_handler.lua(76): get_power_state: system[1] get power power ON
2026-02-26 11:26:46.284816 bmc_network NOTICE: dhcp_process.lua(1255): the field composed by dhcpv6 vendor class is empty string
2026-02-26 11:26:46.290582 bmc_network NOTICE: dhcp_process.lua(1016): eth2: no available prefix info, send_rs_and_parse_ra now
2026-02-26 11:26:58.186340 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:26:58.186392 hardware ERROR: hs_ctrl.c(1308): get_bbu_info failed, ret = 4098
2026-02-26 11:27:04.201009 unknown_service NOTICE: write_service.lua(33): mctp_write_service failed, count: 1, err: write failed, fd = 23, Timer expired [repeated 45 times in 308s from 2026-02-26 11:21:56.760579 to 2026-02-26 11:27:04.201009]
2026-02-26 11:27:19.562302 general_hardware NOTICE: fructl_handler.lua(76): get_power_state: system[1] get power power ON
2026-02-26 11:27:23.210398 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:27:23.211368 storage ERROR: controller_object.lua(1127): get_ctrl_ld_list failed and return ./opt/bmc/apps/storage/lualib/sml/init.lua:79: 4098
2026-02-26 11:27:24.412767 event NOTICE: events.lua(106): System minor count change 0 to 1 by [Event_TempFail_0101010103].
2026-02-26 11:27:24.414505 event NOTICE: events.lua(127): System Health change Normal to Minor.
2026-02-26 11:27:24.434997 event NOTICE: hardware_event.lua(570): Event_TempFail_0101010103|{"value":[1],"source":{"properties":[{"Interface":"bmc.kepler.Systems.NetworkAdapter","Path":"/bmc/kepler/Systems/1/NetworkAdapters/NetworkAdapter_1_0101010103","Service":"bmc.kepler.network_adapter","Property":"TemperatureStatus"}]},"type":"synchronization"}
2026-02-26 11:27:24.548909 event NOTICE: abstract_event.lua(241): [Event_TempFail_0101010103] generate an event [assert] while Reading change to [1].
2026-02-26 11:27:24.555878 redfish NOTICE: alarm.lua(600): received a event[6-0].
2026-02-26 11:27:24.557276 event_policy NOTICE: synchronizer.lua(252): received a event[6-0].
2026-02-26 11:27:48.240541 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:27:48.240588 hardware ERROR: hs_pd.c(102): histore_get_ctrl_pd_list get_pd_list failed ret = 4098
2026-02-26 11:27:48.241422 storage ERROR: controller_object.lua(947): get_ctrl_pd_list failed and return ./opt/bmc/apps/storage/lualib/sml/init.lua:79: 4098
2026-02-26 11:27:59.630960 general_hardware NOTICE: fructl_handler.lua(76): get_power_state: system[1] get power power ON
2026-02-26 11:27:59.640350 unknown_service NOTICE: write_service.lua(33): mctp_write_service failed, count: 2, err: write failed, fd = 23, Timer expired [repeated 12 times in 312s from 2026-02-26 11:22:47.160320 to 2026-02-26 11:27:59.640350]
2026-02-26 11:28:00.148601 bmc_network NOTICE: dhcp_process.lua(1255): the field composed by dhcpv6 vendor class is empty string
2026-02-26 11:28:00.149933 bmc_network NOTICE: dhcp_process.lua(1016): eth2: no available prefix info, send_rs_and_parse_ra now
2026-02-26 11:28:13.260769 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:28:13.260822 hardware ERROR: hs_ctrl.c(385): histore_get_ctrl_info get_ctrl_info failed, ret = 4098
2026-02-26 11:28:38.282263 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:28:38.282318 hardware ERROR: hs_ctrl.c(497): histore_get_ctrl_phy_err_count failed, CtrlId = 0, return 0x1002
2026-02-26 11:28:39.783547 general_hardware NOTICE: fructl_handler.lua(76): get_power_state: system[1] get power power ON
2026-02-26 11:29:03.300628 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:29:03.300683 hardware ERROR: hs_ctrl.c(1308): get_bbu_info failed, ret = 4098
2026-02-26 11:29:09.324152 sensor NOTICE: sensor_instance.lua(1170): sensor [ThresholdSensor_CPU1TADVFS_010101] threshold capability can not be supported. [repeated 33 times in 304s from 2026-02-26 11:24:05.627951 to 2026-02-26 11:29:09.324152]
2026-02-26 11:29:09.324435 sensor NOTICE: sensor_instance.lua(1174): sensor [ThresholdSensor_CPU1TADVFS_010101] hysteresis capability can not be supported. [repeated 33 times in 304s from 2026-02-26 11:24:05.628231 to 2026-02-26 11:29:09.324435]
2026-02-26 11:29:14.163404 bmc_network NOTICE: dhcp_process.lua(1255): the field composed by dhcpv6 vendor class is empty string
2026-02-26 11:29:14.163908 bmc_network NOTICE: dhcp_process.lua(1016): eth2: no available prefix info, send_rs_and_parse_ra now
2026-02-26 11:29:19.821470 general_hardware NOTICE: fructl_handler.lua(76): get_power_state: system[1] get power power ON
2026-02-26 11:29:25.320651 unknown_service NOTICE: write_service.lua(33): mctp_write_service failed, count: 3, err: write failed, fd = 23, Timer expired
2026-02-26 11:29:28.320621 hardware ERROR: hs_misc.c(678): process_histore_cmd: process_histore_cmd return error 0xffffffff
2026-02-26 11:29:28.320674 hardware ERROR: hs_pd.c(102): histore_get_ctrl_pd_list get_pd_list failed ret = 4098
2026-02-26 11:29:28.321695 storage ERROR: controller_object.lua(947): get_ctrl_pd_list failed and return ./opt/bmc/apps/storage/lualib/sml/init.lua:79: 4098

framework日志,hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

2026-02-26 11:10:17.163508 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:10:42.192742 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:11:07.217134 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:11:32.242057 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:11:57.271235 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:12:22.319395 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:12:47.340984 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:13:12.359893 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:13:37.381646 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:14:02.400865 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:14:27.423956 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:14:52.444198 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:15:17.465403 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:15:42.498017 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:16:07.523149 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:16:32.550662 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:16:57.585729 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:17:22.608165 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:17:47.637705 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:18:12.662965 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:18:37.680945 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:19:02.711165 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:19:27.733222 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:19:52.761220 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:20:17.784662 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:20:42.809641 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:21:07.834126 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:21:32.855939 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:21:57.875472 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:22:22.900779 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:22:47.931385 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:23:12.960325 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:23:37.980885 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:24:03.007825 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:24:28.036950 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:24:53.073477 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:25:18.093376 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:25:43.111768 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:26:08.136177 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:26:33.160967 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:26:58.186878 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:27:23.210744 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:27:48.240897 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:28:13.261200 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:28:38.282647 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:29:03.300973 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:29:28.320969 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2026-02-26 11:29:53.350925 [:0000000f] hardware: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

尝试过的解决方案

再次执行Power Cycle命令,恢复正常

完整日志见附件

R410KV2_2102315PFSD9S1100004_20260226-1213_网卡温度告警.tar.gz.txt (6.1 MB)

答案内容由AI生成,仅供参考:

总结

项目 内容
问题本质 BMC与外设(RAID卡、网卡)启动时序不匹配导致通信超时
主要表现 温度传感器No Reading、MCTP写超时、D-Bus无响应
核心原因 BMC过早发起数据采集请求
解决方向 延迟采集 + 防抖机制 + 硬件链路优化
推荐措施 设置FailureDebounceCount=10,启动后延时30秒开始采集

:white_check_mark: 建议将此问题作为“启动阶段传感器稳定性”典型场景纳入测试用例库,并推动BMC框架层统一管理外设初始化同步机制。

点击此处查看详细分析解答

DC稳定性测试中RAID卡、磁盘及网卡温度获取失败问题分析与解决方案


问题描述

在DC(Design Center)稳定性测试中,系统在执行Power Cycle操作后,概率性出现无法获取RAID卡温度、磁盘温度(Disks Temp)以及网卡温度的情况。具体表现为:

  • RAID卡与磁盘温度传感器状态为 No Reading
  • 系统上报“Failed to obtain data of the PCIe Card temperature”告警。
  • 应用日志中频繁出现 get_pd_list failed ret = 4098process_histore_cmd return error 0xffffffff 错误。
  • D-Bus通信日志中出现大量 org.freedesktop.DBus.Error.NoReply 超时错误。
  • 多次MCTP写操作失败,提示 Timer expired

虽然再次执行Power Cycle后问题可恢复,但该现象严重影响系统监控可靠性与稳定性验证。


根本原因分析

结合知识库信息、系统日志与软硬件交互逻辑,问题的根本原因如下:

1. RAID卡与磁盘温度读取失败
  • 关键日志get_pd_list failed ret = 4098 表明BMC在尝试通过histore_get_ctrl_pd_list命令获取RAID控制器下物理磁盘列表时失败。
  • 相关实体分析
    • GetControllerInfo 方法用于获取RAID控制器状态和配置。
    • Raid Card Information Task 可能因请求队列满而被阻塞。
    • RAID卡温度获取失败 是一种系统告警,通常由BMC与RAID卡之间的通信异常引发。
  • 根本原因
    • Power Cycle后,RAID卡启动耗时较长(尤其挂载大量磁盘时),而BMC过早发起查询请求,导致I2C/SMBus通信超时。
    • 若I2C链路存在信号质量差、负载过高或拓扑设计不合理,也会加剧通信失败概率。
    • 文档佐证RAID卡常见问题指南指出,若只有RAID卡温度无法获取,建议优先更换RAID卡;若有多个传感器异常,则可能是I2C链路故障。
2. 网卡温度读取失败
  • 关键日志mctp_write_service failed, err: write failed, fd = 23, Timer expired
  • 相关实体分析
    • 网卡通过MCTP协议与BMC通信,依赖PCIe链路进行带外管理。
    • NetworkAdapter.TemperatureStatus 属性为 “1”(失效)时,触发 Event_TempFail 告警。
    • write_service.lua 报错表明MCTP写入超时,说明PCIe链路未就绪或固件未响应。
  • 根本原因
    • Power Cycle后,网卡固件加载和MCTP通道初始化需要时间,BMC过早发送命令导致超时。
    • openUBMC 182x大板管理特性设计说明书 明确提到,需防止带内热升级或启动过程中温度读取失败引发误告警,说明此类问题是已知挑战。
3. D-Bus通信超时
  • 关键日志DBus.Error.NoReply: Did not receive a reply
  • 影响:BMC内部各服务(如storage、hardware、sensor)通过D-Bus进行通信,超时导致任务堆积、状态同步失败。
  • 可能原因
    • Power Cycle后,某些守护进程启动缓慢或初始化失败。
    • 系统资源(CPU、内存)在启动初期紧张,造成消息队列延迟或丢失。

关联性综合分析

现象 关联组件 可能触发机制
RAID卡/磁盘温度“No Reading” RAID控制器、I2C通信、Storage服务 I2C超时 → get_ctrl_pd_list失败 → 传感器无数据
网卡温度告警 PCIe网卡、MCTP协议、NetworkAdapter服务 MCTP写超时 → 温度状态置为失效 → 触发Event_TempFail
D-Bus超时 BMC内部服务通信 某服务未就绪 → 消息无响应 → 连锁通信异常

核心结论系统启动时序不匹配 是主因。BMC在RAID卡和网卡尚未完成初始化前,已开始轮询传感器数据,导致大量通信失败。


解决方案

1. 延迟传感器采集启动时间
  • 在BMC启动流程中引入传感器采集延迟机制,确保外设充分初始化后再开始轮询。
  • 建议策略:
    • RAID/磁盘采集:在系统上电后 等待至少30秒 再启动Raid Card Information Task
    • 网卡温度采集:在检测到MCTP链路Link UP后再开始温度读取。
2. 增加通信失败重试与防抖机制
  • 使用知识库中定义的 FailureDebounceCount(默认10)SuccessDebounceCount(默认10)
  • 策略:
    • 连续10次读取失败才上报“温度获取失败”告警。
    • 连续10次读取成功才清除告警。
  • 防抖机制参考CSR硬件监控防抖机制文档中 contbin_H10L5 专门用于 RAID卡温度读取失败 的防抖。
3. 检查硬件链路与配置
  • 排查I2C链路
    • 确认RAID卡与BMC之间的I2C线路无过长、干扰或负载过重。
    • 使用逻辑分析仪抓包确认是否存在SCL/SDA冲突。
  • 排查PCIe链路
    • 确保网卡PCIe链路协商成功(x8/x16),无降速或训练失败。
    • 检查网卡是否支持MCTP及NCSI功能(通过 bmc.kepler.Systems.NetworkAdapter 属性验证)。
4. 升级驱动与固件
  • RAID卡驱动:更新storage模块中的hs_ctrl.chs_pd.c驱动版本,修复可能的命令超时处理缺陷。
  • 网卡固件:确保网卡支持MCTP带外管理,并启用温度上报功能。
5. 优化任务调度机制
  • 避免多个Pre-Collection Log Task阻塞Raid Card Information Task
  • 优化请求队列管理,防止任务堆积导致timeout。

验证方法

  1. 复现步骤调整
    • 执行Power Cycle。
    • 等待60秒后再执行ipmitool sdr elist 查看传感器状态。
  2. 验证日志
    • 检查app.log中是否仍有 get_pd_list failedmctp_write_service failed
    • 检查systemd日志确认所有BMC服务是否正常启动。
  3. 多次循环测试
    • 连续执行50次Power Cycle,观察问题复现率是否显著降低。

总结

项目 内容
问题本质 BMC与外设(RAID卡、网卡)启动时序不匹配导致通信超时
主要表现 温度传感器No Reading、MCTP写超时、D-Bus无响应
核心原因 BMC过早发起数据采集请求
解决方向 延迟采集 + 防抖机制 + 硬件链路优化
推荐措施 设置FailureDebounceCount=10,启动后延时30秒开始采集

:white_check_mark: 建议将此问题作为“启动阶段传感器稳定性”典型场景纳入测试用例库,并推动BMC框架层统一管理外设初始化同步机制。


参考文档RAID卡常见问题指南, CSR硬件监控防抖机制, openUBMC 182x大板管理特性设计说明书, 南向网卡驱动适配指南

DC是指重启OS吗
4098错误码对应#define SML_ERR_MCTP_READ_WRITE_FAILED (SML_ERR_CODE_BASE + 0x02) // MCTP通信失败
感觉是MCTP相关流程出错了,看下endpoint正常不

下电再上电