BMC版本:5.05.12.15
现象:反复reboot测试,网卡槽位概率255
日志分析:
- hardware进程重拉后pcie_device组件过早拿到了pcie_device,进入sync_pcie_addr_info_to_device流程,
- 此时pcieaddrinfo还未执行完代码流程,slotid为默认值255,同步给pcie_device,导致问题产生
框架日志:
2025-04-04 02:57:06.954189 hwproxy NOTICE: init.lua(324): [power_mgmt_plugins] run cmd[start_power_monitor]
2025-04-04 02:57:06.954541 hwproxy NOTICE: init.lua(226): set [power] read interval 10 successfully
2025-04-04 02:57:07.968315 hwproxy NOTICE: init.lua(324): [power_mgmt_plugins] run cmd[start_power_monitor]
2025-04-04 02:57:07.968681 hwproxy NOTICE: init.lua(226): set [power] read interval 10 successfully
2025-04-04 02:59:23.881456 [:00000014] hardware: lua call [15 to :14 : 2519532 msgsz = 18] error : [31m./opt/bmc/skynet/lualib/skynet.lua:901: ./opt/bmc/skynet/lualib/skynet.lua:330: .../bmc/apps/network_adapter/lualib/hardware_config/CX5.lua:86: attempt to perform bitwise operation on a nil value (local 'v') stack traceback: .../bmc/apps/network_adapter/lualib/hardware_config/CX5.lua:86: in local 'bsd_to_string' .../bmc/apps/network_adapter/lualib/hardware_config/CX5.lua:93: in field 'parser' ./opt/bmc/lualib/libmgmt_protocol/transport/scheduler.lua:60: in upvalue 'func' ./opt/bmc/skynet/lualib/skynet.lua:804: in local 'cb' ./opt/bmc/libmc/lualib/mc/context.lua:146: in function 'mc.context.with_context' (...tail calls...) ./opt/bmc/skynet/lualib/skynet.lua:280: in function <./opt/bmc/skynet/lualib/skynet.lua:252> stack traceback: [C]: in function 'assert' ./opt/bmc/skynet/lualib/skynet.lua:901: in function 'skynet.manager.dispatch_message'[0m
2025-04-04 03:00:18.245890 framework ERROR: sml_ctrl.c(454): is_logical_drive_epd:Get LD target id failed, return 4609
2025-04-04 04:28:20.447758 maca NOTICE: base.lua(317): component[network_adapter] exited total 1 times, total 1 times for all modules
2025-04-04 04:28:20.722988 maca NOTICE: base.lua(317): component[general_hardware] exited total 1 times, total 2 times for all modules
2025-04-04 04:28:20.952902 maca NOTICE: base.lua(317): component[storage] exited total 1 times, total 3 times for all modules
2025-04-04 04:28:21.058167 maca NOTICE: base.lua(317): component[mctpd] exited total 1 times, total 4 times for all modules
2025-04-04 04:28:21.162611 maca NOTICE: base.lua(317): component[lsw] exited total 1 times, total 5 times for all modules
2025-04-04 04:28:21.255855 maca NOTICE: base.lua(317): component[thermal_mgmt] exited total 1 times, total 6 times for all modules
2025-04-04 04:28:21.350056 maca NOTICE: base.lua(317): component[chassis] exited total 1 times, total 7 times for all modules
2025-04-04 04:28:21.450087 maca NOTICE: base.lua(317): component[power_mgmt] exited total 1 times, total 8 times for all modules
2025-04-04 04:28:21.544329 maca NOTICE: base.lua(317): component[manufacture] exited total 1 times, total 9 times for all modules
2025-04-04 04:28:21.644697 maca NOTICE: base.lua(317): component[compute] exited total 1 times, total 10 times for all modules
2025-04-04 04:28:21.768435 maca NOTICE: base.lua(317): component[pcie_device] exited total 1 times, total 11 times for all modules
2025-04-04 04:28:21.915804 maca NOTICE: base.lua(317): component[bmc_soc] exited total 1 times, total 12 times for all modules
2025-04-04 04:28:22.032830 maca NOTICE: base.lua(317): component[bios] exited total 1 times, total 13 times for all modules
2025-04-04 04:28:22.774125 [:00000002] hardware: LAUNCH snlua bootstrap
2025-04-04 04:28:22.825549 [:00000003] hardware: LAUNCH snlua launcher
2025-04-04 04:28:22.845106 [:00000004] hardware: LAUNCH snlua cdummy
2025-04-04 04:28:22.857536 [:00000005] hardware: LAUNCH harbor 0 4
2025-04-04 04:28:22.865222 [:00000006] hardware: LAUNCH snlua datacenterd
2025-04-04 04:28:22.875281 [:00000007] hardware: LAUNCH snlua service_mgr
2025-04-04 04:28:22.905720 [:00000008] hardware: LAUNCH snlua hica/subsys/hardware/service/main
2025-04-04 04:28:22.926035 [:00000009] hardware: LAUNCH snlua sd_bus
2025-04-04 04:28:23.471814 [:0000000a] hardware: LAUNCH snlua harbor
2025-04-04 04:28:23.534683 [:00000002] hardware: KILL self
2025-04-04 04:28:23.542737 [:0000000b] hardware: LAUNCH snlua pcie_device/service/main
2025-04-04 04:28:23.543062 [:0000000c] hardware: LAUNCH snlua bios/service/main
2025-04-04 04:28:23.543208 [:0000000d] hardware: LAUNCH snlua compute/service/main
2025-04-04 04:28:23.543322 [:0000000e] hardware: LAUNCH snlua bmc_soc/service/main
2025-04-04 04:28:23.543431 [:0000000f] hardware: LAUNCH snlua manufacture/service/main
2025-04-04 04:28:23.543536 [:00000010] hardware: LAUNCH snlua thermal_mgmt/service/main
2025-04-04 04:28:24.238124 [:00000012] hardware: LAUNCH snlua general_hardware/service/main
2025-04-04 04:28:24.750841 [:00000013] hardware: LAUNCH snlua power_mgmt/service/main
2025-04-04 04:28:24.944060 [:00000014] hardware: LAUNCH snlua network_adapter/service/main
2025-04-04 04:28:24.944733 [:00000015] hardware: LAUNCH snlua mctpd/service/main
2025-04-04 04:28:24.945273 [:00000016] hardware: LAUNCH snlua lsw/service/main
2025-04-04 04:28:24.945803 [:00000017] hardware: LAUNCH snlua storage/service/main
2025-04-04 04:28:25.061628 [:00000018] hardware: LAUNCH snlua chassis/service/main
2025-04-04 04:28:25.213006 [:00000019] hardware: LAUNCH snlua mctpd/service/write_service
2025-04-04 04:28:26.047537 [:0000001a] hardware: LAUNCH snlua storage/service/smld
2025-04-04 04:28:26.278903 maca NOTICE: base.lua(374): monitor component mctpd added, service: bmc.kepler.mctpd
求助:
1.hardware 报错日志可能的原因,比如:ncsi返回的数据未校验
2.hardware 异常出现在03:00,为什么04:28 maca检测到hardware多个组件异常退出
3.maca检查机制是:连续3次检测到一个组件健康状态异常才会重拉进程 还是检测到一个进程下所有组件多少次健康状态异常会重拉进程
LogDump.zip (3.0 MB)
app_revision.txt (3.8 KB)