【问题求助】网卡概率出现mctp超时,获取温度失败

// 此模板仅供参考,如果不适用可以修改

问题描述

网卡概率出现mctp超时,获取网卡温度或光模块温度失败

如果连续获取光模块温度失败超3次,光模块在位会被置为0,导致web界面光模块信息不显示,通信恢复后又能显示–概率出现

2026-03-14 21:13:43.810086 network_adapter WARNING: init.lua(767): service[bmc.kepler.network_adapter] request timeout: remote service[bmc.kepler.mctpd], path[/bmc/kepler/Systems/1/Mctp/Endpoint/65/2], interface[bmc.kepler.Systems.Mctp.PCIeEndpoint], method[Request], used time[521s]
2026-03-14 21:13:45.609870 network_adapter WARNING: init.lua(1129): Requestor Skynet message queue scheduling delay is 1760 ms (threshold: 500 ms), service_name=bmc.kepler.network_adapter, path=/bmc/kepler/persistence, interface=bmc.kepler.persistence, method_name=Save
2026-03-14 21:13:47.129308 network_adapter WARNING: init.lua(1129): Requestor Skynet message queue scheduling delay is 517 ms (threshold: 500 ms), service_name=bmc.kepler.network_adapter, path=/bmc/kepler/persistence, interface=bmc.kepler.persistence, method_name=Save
2026-03-14 21:13:48.251421 network_adapter WARNING: harbor_client.lua(386): Receiver Skynet message queue scheduling delay is 549 ms (threshold: 500 ms), service_name=bmc.kepler.network_adapter, path=/bmc/kepler/Systems/1/NetworkAdapters/NetworkAdapter_1_0101010302/Ports/NetworkPort_2_0101010302/OpticalModule, interface=bmc.kepler.Metric, method_name=GetData [repeated 18 times in 315s from 2026-03-14 21:08:33.589930 to 2026-03-14 21:13:48.251421]
2026-03-14 21:14:12.487976 network_adapter NOTICE: network_adapter.lua(1997): PCIeCard2(BCM957414A4142CC) update firmware version to 234.0.150.0 by ncsi
2026-03-14 21:14:13.406014 network_adapter NOTICE: network_adapter.lua(1997): PCIeCard3(BCM957412A4120AC) update firmware version to 234.0.150.0 by ncsi
2026-03-14 21:14:16.008806 network_adapter WARNING: init.lua(1129): Requestor Skynet message queue scheduling delay is 1017 ms (threshold: 500 ms), service_name=bmc.kepler.network_adapter, path=/bmc/kepler/persistence, interface=bmc.kepler.persistence, method_name=Save
2026-03-14 21:15:08.934731 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout [repeated 27 times in 309s from 2026-03-14 21:09:59.516712 to 2026-03-14 21:15:08.934731]
2026-03-14 21:15:38.961288 network_adapter ERROR: optical_module.lua(319): PCIeCard1(BCM957508-P2100G) optical_module_1 update optical temp by ncsi on_error [repeated 4 times in 243s from 2026-03-14 21:10:11.396496 to 2026-03-14 21:14:14.024439][flush]
2026-03-14 21:15:38.961491 network_adapter ERROR: optical_module.lua(319): PCIeCard1(BCM957508-P2100G) optical_module_0 update optical temp by ncsi on_error [repeated 4 times in 256s from 2026-03-14 21:09:58.462459 to 2026-03-14 21:14:14.184955][flush]
2026-03-14 21:15:38.961599 network_adapter NOTICE: optical_module.lua(325): unable to get op info after 3 times, set Presence to 0 (stack traceback:  ...s/network_adapter/lualib/device/class/optical_module.lua:326: in function <...s/network_adapter/lualib/device/class/optical_module.lua:300>  [C]: in function 'pcall'  ./opt/bmc/libmc/lualib/mc/signal.lua:93: in function 'run_slot'  ./opt/bmc/libmc/lualib/mc/signal.lua:113: in function <./opt/bmc/libmc/lualib/mc/signal.lua:103>  [C]: in function 'pcall'  ./opt/bmc/libmc/lualib/mc/signal.lua:293: in function 'emit'  ./opt/bmc/lualib/libmgmt_protocol/transport/scheduler.lua:43: in function ''  /opt/bmc/skynet/lualib/skynet.lua: in function </opt/bmc/skynet/lualib/skynet.lua:0>  [builtin#21]: at 0xffffb09f2d28  [C]: in function 'pcall'  ./opt/bmc/libmc/lualib/mc/context.lua:212: in function 'with_context'  ./opt/bmc/libmc/lualib/mc/app_preloader.lua:97: in function ''  /opt/bmc/skynet/lualib/skynet.lua: in function </opt/bmc/skynet/lualib/skynet.lua:0>) [repeated 8 times in 256s from 2026-03-14 21:09:58.464853 to 2026-03-14 21:14:14.186262][flush]

如果持续次数较少,温度还能维持上一正常值,但有以下日志刷屏–这个基本是必现

2026-03-17 21:37:04.177956 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:37:24.182987 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:37:35.250016 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:37:35.253244 network_adapter ERROR: optical_module.lua(319): PCIeCard3(BCM957508-P2100G) optical_module_1 update optical temp by ncsi on_error
2026-03-17 21:37:55.290603 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:38:05.669149 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:38:05.671441 network_adapter ERROR: optical_module.lua(319): PCIeCard3(BCM957508-P2100G) optical_module_0 update optical temp by ncsi on_error
2026-03-17 21:38:21.192924 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:38:21.228375 network_adapter ERROR: optical_module.lua(319): PCIeCard3(BCM957508-P2100G) optical_module_1 update optical temp by ncsi on_error
2026-03-17 21:38:24.773828 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:38:24.778880 network_adapter ERROR: optical_module.lua(319): PCIeCard3(BCM957508-P2100G) optical_module_0 update optical temp by ncsi on_error
2026-03-17 21:38:56.809959 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:39:00.371169 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:39:27.351330 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout

环境信息

  • 软件版本:OpenUBMC LTS SP1

问题环境上使用BC83PRUO Riser插了3张PCIe网卡,只有槽位3会上报获取温度失败,有更换过一次1/3槽位,都是3槽位上的卡获取温度失败;
在另一环境上使用BC83PRUO Riser插了以下2张卡,槽位3的网卡也会时不时刷获取温度失败的日志,因此怀疑槽位3是否有什么特殊性,但没有进一步定位的思路,特来请教一下

Slot | Vendor Id  | Device Id  | Sub Vendor Id   | Sub Device Id   | Bus Number   | Device Number  | Function Number  | Card Desc
3    | 0x8086     | 0x57b0     | 0x8086          | 0x0002          | 0x16         | 0x00           | 0x00             | E610XT2M5
2    | 0x1000     | 0x0017     | 0x1000          | 0x9440          | 0x16         | 0x04           | 0x00             | PCIe RAID

2026-02-03 02:50:30.410256 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-02-03 02:50:30.412569 network_adapter ERROR: network_adapter.lua(2022): PCIeCard3(E610XT2M5) update chip temp by ncsi on_error