AC测试,概率出现BMC重启的问题

S920X20做整机AC测试,BMC版本是2025.12,跑了数圈以后,出现BMC重启的异常告警。跑了2天,共出现2次,非偶然现象,如果持续跑,大概率还会出现。

The iBMC1036 |2026-04-17 05:18:34 |Normal |0x1A00006F |Asserted |The openUBMC is reset, with the cause code of 0x10.

555 |2026-04-16 11:04:30 |Normal |0x1A00006F |Asserted |The openUBMC is reset, with the cause code of 0x10.

通过排查app.log日志发现:

2026-04-17 05:15:47.281590 killall busybox_x
2026-04-17 05:15:47.341601 Remove cdev_veth_drv successfully.
2026-04-17 05:15:47.445149 Remove bmc_tpcm_drv successfully.
2026-04-17 05:15:48.322566 Remove bmc_veth_drv successfully.
2026-04-17 05:15:51.085586 Remove bmc_cdev_drv successfully.
2026-04-17 05:15:51.310982 Remove bmc_edma_drv successfully.
rmmod: can’t unload module ‘bmc_hisport_drv’: Resource temporarily unavailable
rmmod: can’t unload module ‘bmc_hisport_drv’: Resource temporarily unavailable

….(重复N次)

rmmod: can’t unload module ‘bmc_hisport_drv’: Resource temporarily unavailable
2026-04-17 05:16:50.242275 Remove bmc_hisport_drv successfully.
bios ERROR: bios file not exist
1970-01-01 00:00:17.328023 metric_analyzer NOTICE: File size is equal to given size
1970-01-01 00:00:20.620811 file_transfer NOTICE: get_dbus_connection:1905 > Dbus (session bus) connected

…(省略重启日志)

2026-04-17 05:18:35.665326 event NOTICE: software_event.lua(158): [BMC_BMC.BMCRestartWithCause_[“0x10”]___] generate an event [assert].(上报重启异常事件)

通过排查kbox_info日志发现,存在kbox restart事件,可能是因为skynet 进程不断崩溃并产生 core dump,导致看门狗超时触发硬件重启:

[20260417051230][I2C-03-RT<-11|-11>] M<0x5c|0x1|0x0|1|10> L<1|3|0> C<1|0|3> E<0x801|0x0|0x0|0x1000> I<0x0|0x510|0x0>
[20260417051232][I2C-03-RT<-11|-11>] M<0x5d|0x1|0x0|1|10> L<1|3|0> C<1|0|3> E<0x801|0x0|0x0|0x1000> I<0x0|0x510|0x0>
[20260417051233][I2C-03-RT<-11|-11>] M<0x5c|0x1|0x0|1|10> L<1|3|0> C<1|0|3> E<0x801|0x0|0x0|0x1000> I<0x0|0x510|0x0>
[20260417051234][I2C-03-RT<-11|-11>] M<0x5d|0x1|0x0|1|10> L<1|2|0> C<1|0|2> E<0x801|0x0|0x0|0x1000> I<0x0|0x510|0x0>
[20260417051234][I2C-03-RT<-11|-11>] M<0x5d|0x1|0x0|1|10> L<1|3|0> C<1|0|3> E<0x801|0x0|0x0|0x1000> I<0x0|0x510|0x0>
[20260417051235][I2C-03-RT<-11|-11>] M<0x5c|0x1|0x0|1|10> L<1|2|0> C<1|0|2> E<0x801|0x0|0x0|0x1000> I<0x0|0x510|0x0>
[20260417051238]coredump:do_coredump 1029,task skynet 6263 5225:end
[20260417051238][I2C-03-RT<-11|-11>] M<0x5c|0x1|0x0|1|10> L<1|2|0> C<1|0|2> E<0x801|0x0|0x0|0x1000> I<0x0|0x510|0x0>
[20260417051239][I2C-03-RT<-11|-11>] M<0x5c|0x1|0x0|1|10> L<1|3|0> C<1|0|3> E<0x801|0x0|0x0|0x1000> I<0x0|0x510|0x0>
[20260417051244]EXT4-fs (mmcblk0gp1): mounted filesystem 5669ad5d-31d8-4027-8a79-916d6e85c70e ro without journal. Opts: noload
[20260417051244]EXT4-fs (mmcblk0gp1): unmounting filesystem 5669ad5d-31d8-4027-8a79-916d6e85c70e.
[20260417051246]76,2026-04-17 05:12:46,6435,msg_sec_module_cmd_cpy_to_user,523,msg copy_to_user user data error 8
[20260417051247]77,2026-04-17 05:12:46,6435,msg_scm3_ioctl,613,msg cmd(0xc0047301) handle error!
[20260417051309]coredump:do_coredump 767,task skynet 6607 6424:start
[20260417051309]78,2026-04-17 05:13:09,6512,ipmb_cmd_check_data_readable,1285,ipmb check read was interrupted, ret = -4
[20260417051309]coredump:coredump_wait 543,task skynet 6607 6424:star wait
[20260417051309][idump/coredump] tgid=6424 task[6607 skynet], out of the stack, dump size:5f5e000
[20260417051318][I2C-08-RT<-5|-5>] M<0x21|0x1|0x0|1|10> L<1|1|0> C<1|0|1> E<0x801|0x0|0x0|0x1> I<0x0|0x710|0x0>
[20260417051326]coredump:do_coredump 1029,task skynet 6607 6424:end
[20260417051334]EXT4-fs (mmcblk0gp1): mounted filesystem 5669ad5d-31d8-4027-8a79-916d6e85c70e ro without journal. Opts: noload
[20260417051334]EXT4-fs (mmcblk0gp1): unmounting filesystem 5669ad5d-31d8-4027-8a79-916d6e85c70e.
[20260417051336]79,2026-04-17 05:13:36,6759,msg_sec_module_cmd_cpy_to_user,523,msg copy_to_user user data error 8
[20260417051336]80,2026-04-17 05:13:36,6759,msg_scm3_ioctl,613,msg cmd(0xc0047301) handle error!
[20260417051355]coredump:do_coredump 767,task skynet 6871 6737:start
[20260417051355]81,2026-04-17 05:13:55,6827,ipmb_cmd_check_data_readable,1285,ipmb check read was interrupted, ret = -4
[20260417051355]coredump:coredump_wait 543,task skynet 6871 6737:star wait
[20260417051355][idump/coredump] tgid=6737 task[6871 skynet], out of the stack, dump size:5f5e000
[20260417051412]coredump:do_coredump 1029,task skynet 6871 6737:end
[20260417051418][I2C-08-RT<-5|-5>] M<0x21|0x1|0x0|1|10> L<1|1|0> C<1|0|1> E<0x801|0x0|0x0|0x1> I<0x0|0x710|0x0>
[20260417051419]82,2026-04-17 05:14:19,7017,msg_sec_module_cmd_cpy_to_user,523,msg copy_to_user user data error 8
[20260417051419]83,2026-04-17 05:14:19,7017,msg_scm3_ioctl,613,msg cmd(0xc0047301) handle error!
[20260417051419]EXT4-fs (mmcblk0gp1): mounted filesystem 5669ad5d-31d8-4027-8a79-916d6e85c70e ro without journal. Opts: noload
[20260417051420]EXT4-fs (mmcblk0gp1): unmounting filesystem 5669ad5d-31d8-4027-8a79-916d6e85c70e.
[20260417051440]coredump:do_coredump 767,task skynet 7136 7008:start
[20260417051440]coredump:coredump_wait 543,task skynet 7136 7008:star wait
[20260417051440]84,2026-04-17 05:14:40,7099,ipmb_cmd_check_data_readable,1285,ipmb check read was interrupted, ret = -4
[20260417051440][idump/coredump] tgid=7008 task[7136 skynet], out of the stack, dump size:5f5e000
[20260417051458]coredump:do_coredump 1029,task skynet 7136 7008:end
[20260417051505]85,2026-04-17 05:15:05,7298,msg_sec_module_cmd_cpy_to_user,523,msg copy_to_user user data error 8
[20260417051505]86,2026-04-17 05:15:05,7298,msg_scm3_ioctl,613,msg cmd(0xc0047301) handle error!
[20260417051506]EXT4-fs (mmcblk0gp1): mounted filesystem 5669ad5d-31d8-4027-8a79-916d6e85c70e ro without journal. Opts: noload
[20260417051506]EXT4-fs (mmcblk0gp1): unmounting filesystem 5669ad5d-31d8-4027-8a79-916d6e85c70e.
[20260417051518][I2C-08-RT<-5|-5>] M<0x21|0x1|0x0|1|10> L<1|1|0> C<1|0|1> E<0x801|0x0|0x0|0x1> I<0x0|0x710|0x0>
[20260417051527]coredump:do_coredump 767,task skynet 7421 7266:start
[20260417051527]coredump:coredump_wait 543,task skynet 7421 7266:star wait
[20260417051527]87,2026-04-17 05:15:27,7376,ipmb_cmd_check_data_readable,1285,ipmb check read was interrupted, ret = -4
[20260417051527][idump/coredump] tgid=7266 task[7421 skynet], out of the stack, dump size:5f5e000
[20260417051544]coredump:do_coredump 1029,task skynet 7421 7266:end
[20260417051547]88,2026-04-17 05:15:47,5678,gpio_cmd_get_int_event,1321,gpio_get_int_event errs(-4)
[20260417051547]rsyslogd (6301) used greatest stack depth: 10576 bytes left
[20260417051553]89,2026-04-17 05:15:53,7609,msg_sec_module_cmd_cpy_to_user,523,msg copy_to_user user data error 8
[20260417051553]90,2026-04-17 05:15:53,7609,msg_scm3_ioctl,613,msg cmd(0xc0047301) handle error!
[20260417051554]EXT4-fs (mmcblk0gp1): mounted filesystem 5669ad5d-31d8-4027-8a79-916d6e85c70e ro without journal. Opts: noload
[20260417051554]EXT4-fs (mmcblk0gp1): unmounting filesystem 5669ad5d-31d8-4027-8a79-916d6e85c70e.
[20260417051601]coredump:do_coredump 767,task skynet 5015 5015:start
[20260417051601]coredump:coredump_wait 543,task skynet 5015 5015:star wait
[20260417051601][idump/coredump] tgid=5015 task[5015 skynet], out of the stack, dump size:5f5e000
[20260417051607]coredump:do_coredump 767,task skynet 7563 7563:start
[20260417051607]coredump:coredump_wait 543,task skynet 7563 7563:star wait
[20260417051607][idump/coredump] tgid=7563 task[7563 skynet], out of the stack, dump size:5f5e000
[20260417051617]coredump:do_coredump 1029,task skynet 5015 5015:end
[20260417051619][I2C-08-RT<-5|-5>] M<0x21|0x1|0x0|1|10> L<1|1|0> C<1|0|1> E<0x801|0x0|0x0|0x1> I<0x0|0x710|0x0>
[20260417051624]coredump:do_coredump 1029,task skynet 7563 7563:end
[20260417051634]coredump:do_coredump 767,task skynet 5208 5208:start
[20260417051634]coredump:coredump_wait 543,task skynet 5208 5208:star wait
[20260417051634][idump/coredump] tgid=5208 task[5208 skynet], out of the stack, dump size:5f5e000
[20260417051649]coredump:do_coredump 1029,task skynet 5208 5208:end
[20260417051650]usb device-0: detach driver
[20260417051650]usb device-0: gadget disconnect
[20260417051650]usb device-0: reset configuration. gadget state <0>
[20260417051650]usb device-0: gadget unbind
[20260417051650]dwc3 8800000.dwc3: request ffff0f05703a6540 was not queued to ep1in
[20260417051650]dwc3 8800000.dwc3: request ffff0f05703a60c0 was not queued to ep2in
[20260417051650]hisport_cdev: 1342, listen thread end, ret=0
[20260417051652]EXT4-fs (mmcblk0gp3): unmounting filesystem 8109ea91-1457-4c48-81fd-ae0e36fee24d.
[20260417051652]EXT4-fs (loop0): unmounting filesystem 3fe8b642-0f76-409a-9d36-5dc2c195ac65.
[20260417051652]systemd-journald[1273]: Received client request to relinquish /var/log/journal/db12d62a826d4fa9ab8262511bf7b714 access.
[20260417051653]EXT4-fs (mmcblk0p9): unmounting filesystem cc4cc981-2a99-4410-b84b-098c954d89b1.
[20260417051654]systemd-shutdown[1]: Syncing filesystems and block devices.
[20260417051654]systemd-shutdown[1]: Sending SIGTERM to remaining processes…
[20260417051654]systemd-journald[1273]: Received SIGTERM from PID 1 (systemd-shutdow).
[20260417051655]systemd-shutdown[1]: Sending SIGKILL to remaining processes…
[20260417051655]systemd-shutdown[1]: Unmounting file systems.
20260417051655[8798]: Remounting ‘/data/trust’ read-only in with options ‘(null)’.
[20260417051655]EXT4-fs (mmcblk0p7): re-mounted 557ef19a-2dd2-4151-aae0-6964c112a9f6 ro. Opts: (null)
20260417051655[8799]: Unmounting ‘/data/trust’.
[20260417051655]EXT4-fs (mmcblk0p7): unmounting filesystem 557ef19a-2dd2-4151-aae0-6964c112a9f6.
20260417051655[8799]: Unmounted ‘/data/trust’.
20260417051655[8800]: Remounting ‘/data’ read-only in with options ‘(null)’.
20260417051655[8800]: Failed to remount ‘/data’ read-only: Device or resource busy
20260417051655[8801]: Unmounting ‘/data’.
20260417051655[8801]: Failed to unmount /data: Device or resource busy
20260417051655[8802]: Remounting ‘/data’ read-only in with options ‘(null)’.
20260417051655[8802]: Failed to remount ‘/data’ read-only: Device or resource busy
20260417051655[8803]: Unmounting ‘/data’.
20260417051655[8803]: Failed to unmount /data: Device or resource busy
[20260417051655]systemd-shutdown[1]: Not all file systems unmounted, 1 left.
[20260417051655]systemd-shutdown[1]: Deactivating swaps.
[20260417051655]systemd-shutdown[1]: All swaps deactivated.
[20260417051655]systemd-shutdown[1]: Detaching loop devices.
[20260417051655]systemd-shutdown[1]: Detaching loopback /dev/loop0.
[20260417051655]systemd-shutdown[1]: All loop devices detached.
[20260417051655]systemd-shutdown[1]: Stopping MD devices.
[20260417051655]systemd-shutdown[1]: All MD devices stopped.
[20260417051655]systemd-shutdown[1]: Detaching DM devices.
[20260417051655]systemd-shutdown[1]: Detaching DM /dev/dm-0 (253:0).
[20260417051655]device-mapper: ioctl: systemd-shutdow[1]: dm-0 (mmcblksp) is removed successfully
[20260417051655]systemd-shutdown[1]: All DM devices detached.
[20260417051655]systemd-shutdown[1]: Unmounting file systems.
20260417051655[8808]: Remounting ‘/data’ read-only in with options ‘(null)’.
[20260417051655]EXT4-fs (mmcblk0p1): re-mounted 35d433be-5b11-4450-a419-b03ff6d9daba ro. Opts: (null)
20260417051655[8809]: Unmounting ‘/data’.
[20260417051655]EXT4-fs (mmcblk0p1): unmounting filesystem 35d433be-5b11-4450-a419-b03ff6d9daba.
20260417051655[8809]: Unmounted ‘/data’.
[20260417051655]systemd-shutdown[1]: All filesystems unmounted.
[20260417051655]systemd-shutdown[1]: All filesystems, swaps, loop devices, MD devices and DM devices detached.
[20260417051655]systemd-shutdown[1]: Syncing filesystems and block devices.
[20260417051655]systemd-shutdown[1]: Rebooting.
[20260417051655]kvm: exiting hardware virtualization
[20260417051655]Kernel Offset: 0x3cbfe1600000
[20260417051655]Module Offset: 0x3cbf6ce8d000
[20260417051655]Linear Region Offset: 0xf0540000000
[20260417051655]xhci-hcd xhci-hcd.1.auto: Host halt failed, -110
[20260417051655]xhci-hcd xhci-hcd.0.auto: Host halt failed, -110
[20260417051656]91,2026-04-17 05:16:56,1,hi_mci_wait_card_idle,179,Busy; trying anyway
[20260417051657]reboot: Restarting system
[20260417051657]KBOX: [machine_restart:238]
------------------kbox restart:[ 0.586646]----------------
KBOX: load OK
tasklock: sched_ctrl created.
registered taskstats version 1
Loading compiled-in X.509 certificates
page_owner is disabled

答案内容由AI生成,仅供参考:

结论

本次S920X20在AC测试中出现的BMC重启问题,根本原因是openUBMC的skynet核心进程因IPMB通信异常导致频繁崩溃,进而触发看门狗超时,引发硬件级重启,原因码为0x10

虽然AC循环是诱因(引入电气扰动),但本质是软件健壮性不足,未能妥善处理底层通信失败,最终演变为系统性故障。

点击此处查看详细分析解答

S920X20 AC测试中BMC重启问题分析报告

问题概述

在对S920X20型号服务器进行长时间AC(Alternating Current)循环测试过程中,出现了BMC(Baseboard Management Controller)异常重启的问题。该问题在持续运行两天的测试中发生了两次,具有可重现性,属于非偶然现象。

  • 设备型号:S920X20(基于Kunpeng 920架构)
  • BMC版本:openUBMC(日志中显示为2025.12版本)
  • 问题特征:AC测试运行数圈后,BMC自动重启
  • 告警事件The openUBMC is reset, with the cause code of 0x10

核心日志分析

1. 告警事件日志

在BMC重启时产生了以下关键事件:

The openUBMC is reset, with the cause code of 0x10.

这表明BMC因某种原因被系统强制重置,原因码为 0x10

:warning: 说明:该事件对应知识库中的实体 “The OpenUBMC Is Reset With The Cause Code Of 0x10”,表示一次由特定故障触发的BMC重置操作。


2. app.log 日志分析

2026-04-17 05:15:47左右,BMC开始执行驱动卸载流程:

时间戳 日志内容 分析
05:15:47 killall busybox_x 开始终止服务进程
05:15:47 - 05:15:51 成功卸载多个驱动
bmc_veth_drv, bmc_tpcm_drv, bmc_cdev_drv, bmc_edma_drv
系统正进入关机或重启流程
05:15:50 rmmod: can't unload module ‘bmc_hisport_drv’: Resource temporarily unavailable(重复N次) 关键线索! bmc_hisport_drv 驱动因资源被占用无法卸载,持续阻塞
05:16:50 Remove bmc_hisport_drv successfully 经过约60秒延迟后最终卸载成功

随后出现bios ERROR: bios file not exist以及时间回退到1970-01-01,说明系统已失去电源或正在进行冷启动。

最终上报事件:

2026-04-17 05:18:35.665326 event NOTICE: software_event.lua(158): 
[BMC_BMC.BMCRestartWithCause_["0x10"]___] generate an event [assert].

→ 明确记录了以原因码0x10触发BMC重启


3. kbox_info 日志分析

时间戳:2026-04-17 05:12:38 起,日志中频繁出现coredump记录:

[20260417051238] coredump:do_coredump ... task skynet ...
[20260417051309] coredump:do_coredump ... task skynet ...
[20260417051355] coredump:do_coredump ... task skynet ...
...

核心线索:

  • 进程名称skynet
  • 错误类型
    • ipmb check read was interrupted, ret = -4
    • msg copy_to_user user data error 8
    • msg cmd(0xc0047301) handle error!

:white_check_mark: 结论skynet 进程(openUBMC的底层运行框架)持续崩溃并生成core dump。

进一步推断:

  • skynet作为openUBMC的核心框架,其持续崩溃将导致看门狗(Watchdog)超时未喂狗
  • 触发系统硬件级重启 → 导致BMC整机复位
  • 最终上报cause code 0x10(可能是“看门狗超时重启”或其他内部定义)

根本原因分析(Root Cause)

综合判断:Skynet 进程频繁崩溃 → 触发看门狗超时 → BMC 硬件重启

支持证据链如下:

证据来源 内容 意义
kbox_info.log 多次coredump: task skynet 表明BMC核心框架进程崩溃
kbox_info.log ipmb check read was interrupted IPMB通信异常,可能源自MCU/CPLD通信问题
kbox_info.log msg copy_to_user user data error 内核态向用户态拷贝数据失败,典型内存/地址错误
app.log bmc_hisport_drv 卸载失败长达1分钟 驱动被占用,说明有进程未正常退出(如skynet崩溃后残留)
最终事件 BMCRestartWithCause_["0x10"] 崩溃累积导致系统级BMC重启

崩溃链路推测(Crash Chain)

graph TD
    A[AC Power Cycle Stress] --> B[IPMB/I2C通信抖动]
    B --> C[skynet进程中IPMB消息处理异常]
    C --> D[ipmb_cmd_check_data_readable 返回 -4]
    D --> E[msg copy_to_user 失败]
    E --> F[skynet 进程崩溃 + core dump]
    F --> G[看门狗未及时喂狗]
    G --> H[硬件触发BMC重启]
    H --> I[驱动卸载延迟 & bmc_hisport_drv资源占用]
    I --> J[系统重启缓慢完成]

问题定位与解决建议

:white_check_mark: 确认事实

项目 状态
是否为软件问题? :white_check_mark:(BMC固件层)
是否与硬件相关? 间接相关(AC循环引入瞬态干扰)
关键模块 skynet 框架、msg_scm3 驱动、ipmb 通信
关键驱动 bmc_hisport_drv(被长期占用)

:wrench: 解决建议

1. 定位core dump根源

尽快递交core-skynet-*.dump文件,使用GDB进行解析:

/data/gdb_1711 /opt/bmc/skynet/skynet core-skynet-xxx
(gdb) bt
(gdb) info registers

重点查看崩溃是否发生在msg_sec_module_cmd_cpy_to_useripmb_cmd_check_data_readable函数路径上。

参考知识库第4篇《场外解 coredump 的方法》。


2. 加强IPMB错误处理机制

当前ipmb check read was interrupted, ret = -4(EINTR)未做容错处理,应增加重试机制,避免直接抛出异常导致主线程崩溃。

建议在ipmb消息处理层加入:

if (ret == -4) {
   retry_count++;
   if (retry_count < MAX_RETRY) continue;
}

3. 修复内核模块msg_scm3问题

日志中反复出现:

msg copy_to_user user data error 8
msg cmd(0xc0047301) handle error!

→ 说明msg_scm3_ioctl在执行copy_to_user时失败,返回错误码8(通常为EFAULT,即无效地址)

建议

  • 检查传入参数合法性
  • 增加空指针和边界检查
  • 使用access_ok()前置验证地址可访问性

4. 优化驱动卸载逻辑

bmc_hisport_drv被占用近60秒,表明其线程或资源未被及时释放。

建议

  • 检查是否有listen thread未被signal中断
  • 在模块exit函数中增加超时机制强制清理
  • 调整rmmod等待策略,避免系统挂起

5. 升级openUBMC版本

当前版本为“2025.12”,可能存在已知漏洞。建议升级至更稳定版本(如知识库提到的25.12 LTS或后续版本),并确认该问题是否已在修复列表中。

参考知识库中BMC Framework的描述:“在适应9500 HBA卡代码后,AC测试中触发了框架崩溃”——表明类似场景已有先例。


结论

本次S920X20在AC测试中出现的BMC重启问题,根本原因是openUBMC的skynet核心进程因IPMB通信异常导致频繁崩溃,进而触发看门狗超时,引发硬件级重启,原因码为0x10

虽然AC循环是诱因(引入电气扰动),但本质是软件健壮性不足,未能妥善处理底层通信失败,最终演变为系统性故障。


后续建议

  1. :white_check_mark: 提取并分析core dump文件,定位精确崩溃函数
  2. :white_check_mark: 在测试环境中模拟IPMB中断场景,验证修复方案
  3. :white_check_mark: 提交bug至openUBMC社区,推动框架层修复
  4. :white_check_mark: 在AC测试前部署日志监控脚本,实时捕获core dump生成
  5. :white_check_mark: 将“coredump检测”纳入自动化测试必检项(见知识库《openUBMC通用观察点分享》)

部分日志.zip (575.3 KB)

core-skynet-11-framework-20260402054633.zip (4.2 MB)

core-skynet-11-security-20260402054633.zip (4.4 MB)

其他coredump文件.zip (493.7 KB)

能发完整的一键收集日志吗