问题描述
Atlas 300i A2 64G 在AC测试时带外概率获取不到温度传感器,NPU界面无法显示序列号,固件版本等信息
环境信息
-
软件版本:[OpenUBMC2509高可用]
-
重现步骤
-
服务器AC重启后,BMC初始化完成后,NPU加载的传感器概率无法获取到温度值,处理器界面无法获取到序列号,固件版本等信息,功耗信息获取正常
期望结果
服务器AC重启后,NPU传感器正常显示温度信息,处理器界面正常显示卡基础信息
实际结果
NPU温度传感器未获取到温度信息
检查compute日志未发现smbus访问错误
2026-01-14 10:26:10.353216 compute DEBUG: NPUCard.lua(509): NPUCard 2 update ChipHealthStatus state:0
2026-01-14 10:26:10.353968 compute DEBUG: NPUCard.lua(510): update fault state:0
2026-01-14 10:26:10.354684 compute DEBUG: app_preloader.lua(95): create co[thread: 0x11a80a1d2968]
2026-01-14 10:26:10.355538 compute DEBUG: app_preloader.lua(58): co[thread: 0x11a80a1d2968] exit
2026-01-14 10:26:10.356386 compute DEBUG: std_smbus.lua(156): receiving data: 0F 00 00 04 00 02 00 00 00 02 00 00 00 F7 02 69
2026-01-14 10:26:10.357563 compute DEBUG: app_preloader.lua(95): create co[thread: 0x11a80a1d2968]
2026-01-14 10:26:10.358617 compute DEBUG: app_preloader.lua(58): co[thread: 0x11a80a1d2968] exit
2026-01-14 10:26:10.359567 compute DEBUG: std_smbus.lua(156): receiving data: 0D 00 00 21 00 00 00 00 00 00 00 00 00 69 00 00 00 00
2026-01-14 10:26:10.360859 compute DEBUG: app_preloader.lua(95): create co[thread: 0x11a80a1d2968]
2026-01-14 10:26:10.361677 compute DEBUG: app_preloader.lua(58): co[thread: 0x11a80a1d2968] exit
2026-01-14 10:26:10.362865 compute DEBUG: std_smbus.lua(156): receiving data: 21 00 00 1D 00 3D 00 00 00 14 00 00 00 06 4C 4D 37 35 41 5F 54 45 23 00 4C 4D 37 35 42 5F 54 45 24 76
2026-01-14 10:26:10.363934 compute DEBUG: app_preloader.lua(95): create co[thread: 0x11a80a1d2968]
2026-01-14 10:26:10.364575 compute DEBUG: app_preloader.lua(58): co[thread: 0x11a80a1d2968] exit
2026-01-14 10:26:10.365391 compute DEBUG: std_smbus.lua(156): receiving data: 21 00 00 00 00 55 00 00 00 14 00 00 00 00 08 00 09 00 0A 00 0B 00 0C 00 0F 00 10 00 11 00 15 00 18 2A
2026-01-14 10:26:10.366405 compute DEBUG: app_preloader.lua(95): create co[thread: 0x11a80a1d2968]
2026-01-14 10:26:10.367072 compute DEBUG: app_preloader.lua(58): co[thread: 0x11a80a1d2968] exit
2026-01-14 10:26:10.367875 compute DEBUG: std_smbus.lua(156): receiving data: 21 00 00 02 00 02 00 00 00 14 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 C2
2026-01-14 10:26:10.368760 compute DEBUG: NPUCard.lua(546): update error codes
2026-01-14 10:26:10.370075 compute DEBUG: NPUCard.lua(509): NPUCard 9 update ChipHealthStatus state:0
2026-01-14 10:26:10.370769 compute DEBUG: NPUCard.lua(510): update fault state:0
2026-01-14 10:26:10.371562 compute DEBUG: app_preloader.lua(95): create co[thread: 0x11a80a1d2968]
2026-01-14 10:26:10.372591 compute DEBUG: app_preloader.lua(58): co[thread: 0x11a80a1d2968] exit
2026-01-14 10:26:10.373522 compute DEBUG: std_smbus.lua(156): receiving data: 0F 00 00 04 00 02 00 00 00 02 00 00 00 F6 02 7C
2026-01-14 10:26:10.374669 compute DEBUG: app_preloader.lua(95): create co[thread: 0x11a80a1d2968]
app.log中重复打印错误日志为:
2026-01-14 06:32:37.488104 compute ERROR: app_preloader.lua(212): ./opt/bmc/apps/compute/lualib/compute_app.lua:662: app(compute/service/main) count(1219) pcall failed(kepler.class.SetSyncPropertyError: The property Health of the object NPU_1_01010502 is a synchronous property and cannot be set)
尝试过的解决方案
在NPU适配时已增加Chip的写延迟设置,增加写延迟后仍复现出故障现象