// 此模板仅供参考,如果不适用可以修改
问题描述
网卡概率出现mctp超时,获取网卡温度或光模块温度失败
如果连续获取光模块温度失败超3次,光模块在位会被置为0,导致web界面光模块信息不显示,通信恢复后又能显示–概率出现
2026-03-14 21:13:43.810086 network_adapter WARNING: init.lua(767): service[bmc.kepler.network_adapter] request timeout: remote service[bmc.kepler.mctpd], path[/bmc/kepler/Systems/1/Mctp/Endpoint/65/2], interface[bmc.kepler.Systems.Mctp.PCIeEndpoint], method[Request], used time[521s]
2026-03-14 21:13:45.609870 network_adapter WARNING: init.lua(1129): Requestor Skynet message queue scheduling delay is 1760 ms (threshold: 500 ms), service_name=bmc.kepler.network_adapter, path=/bmc/kepler/persistence, interface=bmc.kepler.persistence, method_name=Save
2026-03-14 21:13:47.129308 network_adapter WARNING: init.lua(1129): Requestor Skynet message queue scheduling delay is 517 ms (threshold: 500 ms), service_name=bmc.kepler.network_adapter, path=/bmc/kepler/persistence, interface=bmc.kepler.persistence, method_name=Save
2026-03-14 21:13:48.251421 network_adapter WARNING: harbor_client.lua(386): Receiver Skynet message queue scheduling delay is 549 ms (threshold: 500 ms), service_name=bmc.kepler.network_adapter, path=/bmc/kepler/Systems/1/NetworkAdapters/NetworkAdapter_1_0101010302/Ports/NetworkPort_2_0101010302/OpticalModule, interface=bmc.kepler.Metric, method_name=GetData [repeated 18 times in 315s from 2026-03-14 21:08:33.589930 to 2026-03-14 21:13:48.251421]
2026-03-14 21:14:12.487976 network_adapter NOTICE: network_adapter.lua(1997): PCIeCard2(BCM957414A4142CC) update firmware version to 234.0.150.0 by ncsi
2026-03-14 21:14:13.406014 network_adapter NOTICE: network_adapter.lua(1997): PCIeCard3(BCM957412A4120AC) update firmware version to 234.0.150.0 by ncsi
2026-03-14 21:14:16.008806 network_adapter WARNING: init.lua(1129): Requestor Skynet message queue scheduling delay is 1017 ms (threshold: 500 ms), service_name=bmc.kepler.network_adapter, path=/bmc/kepler/persistence, interface=bmc.kepler.persistence, method_name=Save
2026-03-14 21:15:08.934731 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout [repeated 27 times in 309s from 2026-03-14 21:09:59.516712 to 2026-03-14 21:15:08.934731]
2026-03-14 21:15:38.961288 network_adapter ERROR: optical_module.lua(319): PCIeCard1(BCM957508-P2100G) optical_module_1 update optical temp by ncsi on_error [repeated 4 times in 243s from 2026-03-14 21:10:11.396496 to 2026-03-14 21:14:14.024439][flush]
2026-03-14 21:15:38.961491 network_adapter ERROR: optical_module.lua(319): PCIeCard1(BCM957508-P2100G) optical_module_0 update optical temp by ncsi on_error [repeated 4 times in 256s from 2026-03-14 21:09:58.462459 to 2026-03-14 21:14:14.184955][flush]
2026-03-14 21:15:38.961599 network_adapter NOTICE: optical_module.lua(325): unable to get op info after 3 times, set Presence to 0 (stack traceback: ...s/network_adapter/lualib/device/class/optical_module.lua:326: in function <...s/network_adapter/lualib/device/class/optical_module.lua:300> [C]: in function 'pcall' ./opt/bmc/libmc/lualib/mc/signal.lua:93: in function 'run_slot' ./opt/bmc/libmc/lualib/mc/signal.lua:113: in function <./opt/bmc/libmc/lualib/mc/signal.lua:103> [C]: in function 'pcall' ./opt/bmc/libmc/lualib/mc/signal.lua:293: in function 'emit' ./opt/bmc/lualib/libmgmt_protocol/transport/scheduler.lua:43: in function '' /opt/bmc/skynet/lualib/skynet.lua: in function </opt/bmc/skynet/lualib/skynet.lua:0> [builtin#21]: at 0xffffb09f2d28 [C]: in function 'pcall' ./opt/bmc/libmc/lualib/mc/context.lua:212: in function 'with_context' ./opt/bmc/libmc/lualib/mc/app_preloader.lua:97: in function '' /opt/bmc/skynet/lualib/skynet.lua: in function </opt/bmc/skynet/lualib/skynet.lua:0>) [repeated 8 times in 256s from 2026-03-14 21:09:58.464853 to 2026-03-14 21:14:14.186262][flush]
如果持续次数较少,温度还能维持上一正常值,但有以下日志刷屏–这个基本是必现
2026-03-17 21:37:04.177956 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:37:24.182987 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:37:35.250016 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:37:35.253244 network_adapter ERROR: optical_module.lua(319): PCIeCard3(BCM957508-P2100G) optical_module_1 update optical temp by ncsi on_error
2026-03-17 21:37:55.290603 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:38:05.669149 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:38:05.671441 network_adapter ERROR: optical_module.lua(319): PCIeCard3(BCM957508-P2100G) optical_module_0 update optical temp by ncsi on_error
2026-03-17 21:38:21.192924 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:38:21.228375 network_adapter ERROR: optical_module.lua(319): PCIeCard3(BCM957508-P2100G) optical_module_1 update optical temp by ncsi on_error
2026-03-17 21:38:24.773828 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:38:24.778880 network_adapter ERROR: optical_module.lua(319): PCIeCard3(BCM957508-P2100G) optical_module_0 update optical temp by ncsi on_error
2026-03-17 21:38:56.809959 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:39:00.371169 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-03-17 21:39:27.351330 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
环境信息
- 软件版本:OpenUBMC LTS SP1
问题环境上使用BC83PRUO Riser插了3张PCIe网卡,只有槽位3会上报获取温度失败,有更换过一次1/3槽位,都是3槽位上的卡获取温度失败;
在另一环境上使用BC83PRUO Riser插了以下2张卡,槽位3的网卡也会时不时刷获取温度失败的日志,因此怀疑槽位3是否有什么特殊性,但没有进一步定位的思路,特来请教一下
Slot | Vendor Id | Device Id | Sub Vendor Id | Sub Device Id | Bus Number | Device Number | Function Number | Card Desc
3 | 0x8086 | 0x57b0 | 0x8086 | 0x0002 | 0x16 | 0x00 | 0x00 | E610XT2M5
2 | 0x1000 | 0x0017 | 0x1000 | 0x9440 | 0x16 | 0x04 | 0x00 | PCIe RAID
2026-02-03 02:50:30.410256 mctpd ERROR: mctp_engine.lua(376): [System1]mctp_engine: request timeout
2026-02-03 02:50:30.412569 network_adapter ERROR: network_adapter.lua(2022): PCIeCard3(E610XT2M5) update chip temp by ncsi on_error