NVMe盘快速热插拔导致storage组件挂死

问题背景

对NVMe盘做快速热插拔时storage概率性出现挂死

操作步骤

1、对直通NVMe盘进行快速热插拔(3秒以上)
2、观察NVMe盘的插拔日志能否显示

现象

插入日志约4分钟后才显示

PS

通过mdbctl setprop set Scanner_Drive4PresentAccessor_010102 bmc.kepler.Scanner Value 1/0,快速修改NVMe的在位信号模拟插拔,也很容易复现该问题

初步分析

2026-04-15 15:10:19.189497 event NOTICE: hardware_event.lua(570): Event_Hdd5Insert_010102|{"value":[0],"type":"synchronization","source":{"properties":[{"Path":"/bmc/kepler/Systems/1/Storage/Drives/Drive_5_010102","Service":"bmc.kepler.storage","Interface":"bmc.kepler.Systems.Storage.Drive","Property":"Presence"}],"Default":255}}
2026-04-15 15:10:19.467001 event NOTICE: hardware_event.lua(570): Event_Hdd5Removed_010102|{"value":[0],"type":"synchronization","source":{"properties":[{"Path":"/bmc/kepler/Systems/1/Storage/Drives/Drive_5_010102","Service":"bmc.kepler.storage","Interface":"bmc.kepler.Systems.Storage.Drive","Property":"Presence"}],"Default":255}}
2026-04-15 15:10:19.481646 storage NOTICE: object_manage.lua(228): start to remove objects, path: /bmc/kepler/ObjectGroup/0101020101
2026-04-15 15:10:19.482202 storage NOTICE: object_manage.lua(234): start to DeleteObject Nvme_1_0101020101
2026-04-15 15:10:19.485545 storage NOTICE: drive_collection.lua(141): Nvme 4 is removed
2026-04-15 15:10:19.652084 storage NOTICE: pd_identify_service.lua(129): Disk4 del
2026-04-15 15:10:19.654204 storage NOTICE: drive_object.lua(615): Disk4 is_nvme turn false
2026-04-15 15:10:19.849572 storage ERROR: factory.lua(113): c_factory:del_object failed: unkonw class Nvme
2026-04-15 15:10:19.850329 storage NOTICE: object_manage.lua(234): start to DeleteObject ReplicaDrive_1_0101020101
2026-04-15 15:10:20.023073 storage ERROR: factory.lua(113): c_factory:del_object failed: unkonw class ReplicaDrive
2026-04-15 15:10:20.023734 storage NOTICE: object_manage.lua(252): remove objects completely, path: /bmc/kepler/ObjectGroup/0101020101
2026-04-15 15:10:20.024127 storage NOTICE: object_manage.lua(228): start to remove objects, path: /bmc/kepler/ObjectGroup/01010201
2026-04-15 15:10:20.024435 storage NOTICE: object_manage.lua(234): start to DeleteObject VirtualVPDConnect_01010201
2026-04-15 15:10:20.025787 storage NOTICE: vpd_connector_collection.lua(39): vpd connector 4 del
2026-04-15 15:10:20.034959 storage ERROR: factory.lua(113): c_factory:del_object failed: unkonw class VirtualVPDConnect
2026-04-15 15:10:20.037864 storage NOTICE: object_manage.lua(252): remove objects completely, path: /bmc/kepler/ObjectGroup/01010201
2026-04-15 15:10:25.530214 event NOTICE: hardware_event.lua(570): Event_Hdd5Insert_010102|{"value":[1],"type":"synchronization","source":{"properties":[{"Path":"/bmc/kepler/Systems/1/Storage/Drives/Drive_5_010102","Service":"bmc.kepler.storage","Interface":"bmc.kepler.Systems.Storage.Drive","Property":"Presence"}],"Default":255}}
2026-04-15 15:10:25.847278 storage NOTICE: object_manage.lua(312): start to add objects, path: /bmc/kepler/ObjectGroup/01010201, life cycle id: 1, count: 1, uptime: 444217 s
2026-04-15 15:10:25.884932 storage NOTICE: vpd_connector_collection.lua(35): vpd connector 4 add
2026-04-15 15:10:25.887158 storage NOTICE: object_manage.lua(332): add objects callback, path: /bmc/kepler/ObjectGroup/01010201, life cycle id: 1, count: 1, took 40ms, uptime: 444217 s
2026-04-15 15:10:25.890712 storage NOTICE: object_manage.lua(342): add objects completely, path: /bmc/kepler/ObjectGroup/01010201, life cycle id: 1, took 10ms, uptime: 444217 s
2026-04-15 15:10:25.913806 storage NOTICE: vpd_connector.lua(289): begin update_connector4
2026-04-15 15:10:25.956315 event NOTICE: hardware_event.lua(570): Event_Hdd5Removed_010102|{"value":[1],"type":"synchronization","source":{"properties":[{"Path":"/bmc/kepler/Systems/1/Storage/Drives/Drive_5_010102","Service":"bmc.kepler.storage","Interface":"bmc.kepler.Systems.Storage.Drive","Property":"Presence"}],"Default":255}}
2026-04-15 15:10:26.082109 storage NOTICE: pd_identify_service.lua(125): Disk4 add
2026-04-15 15:10:26.170214 storage NOTICE: vpd_connector.lua(229): connector4 classcode is 1, header_sum is 239, target sum is 239
2026-04-15 15:10:26.170593 storage NOTICE: vpd_connector.lua(267): connector get protocol is 0
2026-04-15 15:10:26.550948 storage NOTICE: object_manage.lua(312): start to add objects, path: /bmc/kepler/ObjectGroup/0101020101, life cycle id: 1, count: 2, uptime: 444217 s
2026-04-15 15:10:26.898927 storage NOTICE: drive_collection.lua(117): Nvme 4 is inserted
2026-04-15 15:10:26.902435 storage NOTICE: drive_collection.lua(134): NVMe :4 update system_id:1
2026-04-15 15:10:26.902866 storage NOTICE: drive_object.lua(571): Start update NVME static info, id:4 identify_pd:false.
2026-04-15 15:10:26.904001 storage NOTICE: drive_object.lua(621): Start update NVME info, id:4 identify_pd:false.
2026-04-15 15:10:26.904431 storage NOTICE: drive_object.lua(622): Disk4 is_nvme turn true
2026-04-15 15:10:27.171557 storage NOTICE: nvme_object.lua(272): NVMe4 load nvme-mi
2026-04-15 15:10:28.676774 storage NOTICE: nvme_object.lua(494): get disk4 NVMe MI 1.0
2026-04-15 15:10:29.046550 storage NOTICE: nvme_object.lua(494): get disk4 NVMe MI 1.0
2026-04-15 15:10:29.726975 storage ERROR: nvme_object.lua(555): get_multirecord_item_offset_len: Read chip failed
2026-04-15 15:10:29.727533 storage ERROR: nvme_object.lua(648): Read chip failed
2026-04-15 15:10:30.576438 event NOTICE: hardware_event.lua(570): Event_Hdd5Insert_010102|{"value":[0],"type":"synchronization","source":{"properties":[{"Path":"/bmc/kepler/Systems/1/Storage/Drives/Drive_5_010102","Service":"bmc.kepler.storage","Interface":"bmc.kepler.Systems.Storage.Drive","Property":"Presence"}],"Default":255}}
2026-04-15 15:10:30.769701 event NOTICE: hardware_event.lua(570): Event_Hdd5Removed_010102|{"value":[0],"type":"synchronization","source":{"properties":[{"Path":"/bmc/kepler/Systems/1/Storage/Drives/Drive_5_010102","Service":"bmc.kepler.storage","Interface":"bmc.kepler.Systems.Storage.Drive","Property":"Presence"}],"Default":255}}
2026-04-15 15:11:23.584565 maca ERROR: init.lua(62): [storage]HealthCheck failed, error: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include:the remote application did not send a reply, the messagebus security policy blocked the reply, the reply timeout expired,or the network connection was broken.
2026-04-15 15:11:23.650313 maca ERROR: init.lua(98): component[storage] is in an abnormal status 1 times
2026-04-15 15:12:53.744736 maca ERROR: init.lua(62): [storage]HealthCheck failed, error: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include:the remote application did not send a reply, the messagebus security policy blocked the reply, the reply timeout expired,or the network connection was broken.
2026-04-15 15:12:53.838197 maca ERROR: init.lua(98): component[storage] is in an abnormal status 2 times
2026-04-15 15:14:23.924241 maca ERROR: init.lua(62): [storage]HealthCheck failed, error: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include:the remote application did not send a reply, the messagebus security policy blocked the reply, the reply timeout expired,or the network connection was broken.
2026-04-15 15:14:24.026783 maca ERROR: init.lua(91): component[storage] is in an abnormal status 3 times, and it will attempt to restart for recovery

可以看到15:10:30之后storage组件就挂死了,maca在15:14:24对storage进行重启后才恢复

2026-04-15 15:10:30.460207 [:00000010] hardware: lua call [0 to :10 : 1572206 msgsz = 0] error : [31m/opt/bmc/skynet/lualib/skynet.lua:0: org.freedesktop.DBus.Error.UnknownInterface: Unknown interface bmc.kepler.Connector stack traceback:  [C]: in function 'error'  ./opt/bmc/libmc/lualib/mc/app_preloader.lua:74: in function ''  /opt/bmc/skynet/lualib/skynet.lua: in function </opt/bmc/skynet/lualib/skynet.lua:0> stack traceback:  [C]: in function 'error'  /opt/bmc/lualib/compat53/module.lua:176: in function ''  /opt/bmc/skynet/lualib/skynet.lua: in function </opt/bmc/skynet/lualib/skynet.lua:0>[0m
2026-04-15 15:11:05.656191 [:00000000] hardware: A message from [ :00000008 ] to [ :0000000f ] maybe in an endless loop (version = 6244063)
2026-04-15 15:11:35.671773 [:00000000] hardware: A message from [ :00000008 ] to [ :0000000f ] maybe in an endless loop (version = 6244063)
2026-04-15 15:12:05.696600 [:00000000] hardware: A message from [ :00000008 ] to [ :0000000f ] maybe in an endless loop (version = 6244063)
2026-04-15 15:12:35.731128 [:00000000] hardware: A message from [ :00000008 ] to [ :0000000f ] maybe in an endless loop (version = 6244063)
2026-04-15 15:13:05.758759 [:00000000] hardware: A message from [ :00000008 ] to [ :0000000f ] maybe in an endless loop (version = 6244063)
2026-04-15 15:13:35.807505 [:00000000] hardware: A message from [ :00000008 ] to [ :0000000f ] maybe in an endless loop (version = 6244063)
2026-04-15 15:14:05.827521 [:00000000] hardware: A message from [ :00000008 ] to [ :0000000f ] maybe in an endless loop (version = 6244063)
2026-04-15 15:14:24.890290 [:00000012] interface: May overload, message queue length = 1994
2026-04-15 15:14:25.596657 [:00000002] hardware: LAUNCH snlua bootstrap
2026-04-15 15:14:25.680125 [:00000003] hardware: LAUNCH snlua launcher
2026-04-15 15:14:25.721082 [:00000004] hardware: LAUNCH snlua cdummy
2026-04-15 15:14:25.768661 [:00000005] hardware: LAUNCH harbor 0 4
2026-04-15 15:14:25.772348 [:00000006] hardware: LAUNCH snlua datacenterd
2026-04-15 15:14:25.813286 [:00000007] hardware: LAUNCH snlua service_mgr
2026-04-15 15:14:25.857467 [:00000008] hardware: LAUNCH snlua hica/subsys/hardware/service/main
2026-04-15 15:14:25.899262 [:00000009] hardware: LAUNCH snlua sd_bus
2026-04-15 15:14:26.590463 [:0000000a] hardware: LAUNCH snlua harbor
2026-04-15 15:14:26.701815 [:00000002] hardware: KILL self
2026-04-15 15:14:26.703892 [:0000000b] hardware: LAUNCH snlua general_hardware/service/main
2026-04-15 15:14:26.705095 [:0000000c] hardware: LAUNCH snlua network_adapter/service/main
2026-04-15 15:14:26.716270 [:0000000d] hardware: LAUNCH snlua mctpd/service/main
2026-04-15 15:14:26.726051 [:0000000e] hardware: LAUNCH snlua lsw/service/main
2026-04-15 15:14:26.742477 [:0000000f] hardware: LAUNCH snlua storage/service/main
2026-04-15 15:14:26.746193 [:00000010] hardware: LAUNCH snlua chassis/service/main
2026-04-15 15:14:28.832292 [:00000012] hardware: LAUNCH snlua mctpd/service/write_service
2026-04-15 15:14:29.024503 [:00000013] hardware: LAUNCH snlua mctpd/service/smbus_write_read_service

通过framework.log可以看到hardware: A message from [ :00000008 ] to [ :0000000f ] maybe in an endless loop (version = 6244063),证明确实陷入死循环了

求助内容

想问一下这类问题如何进行进一步的分析定位

答案内容由AI生成,仅供参考:

:memo: 总结

核心结论:

storage 组件在快速热插拔 NVMe 盘时挂死的根本原因极可能是:

设备移除时 VirtualVPDConnect 等对象未能正确清理其后台任务,导致新插入设备初始化时因资源冲突或通信阻塞而陷入死循环,最终使 D-Bus 服务无响应。

后续行动建议:

  1. :white_check_mark:dtor 中强制停止所有关联任务(参考 yelmh_kno07 提案)
  2. :white_check_mark: 为所有底层芯片读取操作增加超时与重试限制
  3. :white_check_mark: 收集挂死时刻的 gdbstrace 跟踪数据以确认阻塞点
  4. :white_check_mark: 提交 Issue 至 GitCode - openUBMC/storage 跟踪修复

关键词Read chip failed, del_object failed, NoReply, VirtualVPDConnect, dtor, task not stopped

点击此处查看详细分析解答

NVMe盘快速热插拔导致storage组件挂死问题分析报告

:pushpin: 问题概述

在对NVMe盘进行快速热插拔操作时,storage组件出现概率性挂死现象。根据日志记录,最后一次正常操作时间为 2026-04-15 15:10:30,之后 storage 组件失去响应,直到约4分钟后(15:14:24)由 maca 监控组件触发重启才恢复。

核心现象:

  • 热插拔事件后,storage 组件无法及时处理新插入设备
  • 组件挂死后不再响应 D-Bus 请求(NoReply错误)
  • 需依赖外部健康检查机制(maca)强制重启恢复

:magnifying_glass_tilted_left: 日志分析与关键线索

1. :white_check_mark: 正常的插拔流程执行

从日志中可以确认,系统曾正确识别硬件变动并启动设备管理流程:

2026-04-15 15:10:25.530214 event NOTICE: Event_Hdd5Insert_010102 → 检测到插入事件
2026-04-15 15:10:25.847278 storage NOTICE: start to add objects → 启动对象创建流程
2026-04-15 15:10:26.898927 storage NOTICE: Nvme 4 is inserted → 识别为NVMe设备
2026-04-15 15:10:27.171557 storage NOTICE: NVMe4 load nvme-mi → 加载NVMe-MI协议栈

说明系统已进入设备初始化流程。


2. :warning: 关键故障点定位

在设备初始化过程中出现连续错误:

2026-04-15 15:10:29.726975 storage ERROR: nvme_object.lua(555): get_multirecord_item_offset_len: Read chip failed
2026-04-15 15:10:29.727533 storage ERROR: nvme_object.lua(648): Read chip failed

这表明在尝试通过 NVMe-MI 协议I²C 读取 VPD(Vital Product Data)信息时失败。

结合知识库信息:

  • bmc.dev.NVMe.ProductInfo 接口依赖底层芯片通信获取产品信息(型号、序列号等)
  • GetNvmeDriveRawDataGetNvmeInfo 方法用于获取原始数据和详细信息
  • 若底层通信异常,可能导致阻塞式等待或无限重试

3. :red_exclamation_mark:对象删除失败引发潜在资源冲突

在设备移除阶段,出现多次“未知类删除失败”错误:

2026-04-15 15:10:19.849572 storage ERROR: factory.lua(113): c_factory:del_object failed: unkonw class Nvme
2026-04-15 15:10:20.023073 storage ERROR: factory.lua(113): c_factory:del_object failed: unkonw class ReplicaDrive
2026-04-15 15:10:20.034959 storage ERROR: factory.lua(113): c_factory:del_object failed: unkonw class VirtualVPDConnect

这些是 storage 管理的关键对象:

  • Nvme_Object:负责NVMe设备管理
  • ReplicaDrive:用于资源协作接口同步SN信息(见 Issue #9
  • VirtualVPDConnect:用于从NVMe盘读取协议信息(见 search result 1)

推论: 快速插拔时,旧对象尚未完全清理,新插入又试图注册同名对象,可能造成内部状态不一致或锁竞争,最终导致任务阻塞或死循环。


4. :broken_heart: D-Bus 无响应导致组件挂死

最终表现为 maca 健康检查失败:

2026-04-15 15:11:23.584565 maca ERROR: [storage]HealthCheck failed, error: org.freedesktop.DBus.Error.NoReply

说明 storage 组件无法回应任何 D-Bus 请求。可能原因包括:

  • 主事件循环被阻塞
  • 某个 Lua 协程陷入无限等待
  • 资源泄漏导致内存耗尽或调度异常

:puzzle_piece: 根本原因推测

综合分析,可能导致 storage 挂死的根本原因如下:

原因类型 说明
通信阻塞 Read chip failed 表明底层通信异常,若未设置超时或重试上限,可能造成线程/协程永久阻塞
对象生命周期管理缺陷 快速插拔下,旧对象(如 VirtualVPDConnect)未正确析构,新对象创建时发生冲突
任务未正确终止 类似其他案例中提到的 update_vpd_connector 任务未在 dtor 中停止,导致后续访问已释放对象
MCTP链路不稳定 NVMe-MI over MCTP 需建立可靠链路,频繁插拔可能导致端点异常(参考 document chunk 5)

:hammer_and_wrench: 分析与定位建议

1. :white_check_mark: 启用更详细的调试日志

nvme_object.luavpd_connector.lua 中关键路径添加日志:

DEBUG("Entering get_multirecord_item_offset_len for slot %d", self.Slot)
DEBUG("Calling vpd_nvme_mi_get_multirecord_item_offset with retries=%d", max_retries)

重点关注:

  • vpd_nvme_mi_get_* 系列函数调用路径
  • 是否进入死循环或长时间等待

2. :white_check_mark: 检查对象析构逻辑

验证以下对象的 dtor 方法是否被正确调用:

function c_virtual_vpd_connect:dtor()
    local task_name = string.format("update_vpd_connector%s", self.Slot)
    local task = c_tasks.get_instance():get_task(task_name)
    if task then
        task:stop()  -- 确保任务被终止
    end
    -- 其他清理逻辑
end

参考已知问题解决方案(document chunk 1),此类任务必须在析构时显式停止。


3. :white_check_mark: 添加通信超时与降级机制

建议修改 Read chip 相关逻辑,加入:

  • 最大重试次数(如3次)
  • 超时控制(如每次操作 < 1s)
  • 失败后采用默认值或跳过非关键字段

避免因单个设备通信异常导致整个组件阻塞。


4. :white_check_mark: 使用 gdbstrace 抓取挂死现场

若可复现问题,建议在 storage 进程挂死时执行:

# 方法一:查看调用栈
gdb -p $(pgrep storage)
(gdb) bt

# 方法二:跟踪系统调用
strace -p $(pgrep storage) -f -o /tmp/storage_trace.log

重点关注:

  • 是否卡在 read() / ioctl() / i2c_transfer() 等系统调用
  • 是否处于 Lua 协程等待状态

5. :white_check_mark: 检查 MCTP 端点生命周期管理

查看日志中是否有类似:

creat disk6 endpoint successfully
mctp_engine: request timeout

(来自 document chunk 5)

确保:

  • MCTP endpoint 在设备拔出后被正确销毁
  • 新插入设备能重建 endpoint,不复用旧状态

6. :white_check_mark: 查看 /proc/<pid>/stack 获取内核态堆栈

cat /proc/$(pgrep storage)/stack

可判断进程是否阻塞在内核驱动层(如 I2C 总线锁)。


:white_check_mark: 已知相关 Issue 与参考方案

相关问题 链接 关联性
NVMe插拔后无法获取信息 https://gitcode.com/openUBMC/storage/issues/134 :star: 高度相关,同属插拔异常问题
VPD读取失败导致信息丢失 【问题单】NVME盘概率性无法获得通电时间等信息 中等,通信失败场景一致
VirtualVPDConnect任务未停止 document chunk 1 解决方案 :star: 直接可借鉴

:white_check_mark: 建议的短期规避措施

措施 描述
延长插拔间隔 建议大于10秒,避免状态紊乱
禁用非必要任务 vpd_connector 初始化前不启动周期性任务
增加 watchdog 超时 缩短 maca 重启判定时间,减少服务中断窗口

:memo: 总结

核心结论:

storage 组件在快速热插拔 NVMe 盘时挂死的根本原因极可能是:

设备移除时 VirtualVPDConnect 等对象未能正确清理其后台任务,导致新插入设备初始化时因资源冲突或通信阻塞而陷入死循环,最终使 D-Bus 服务无响应。

后续行动建议:

  1. :white_check_mark:dtor 中强制停止所有关联任务(参考 yelmh_kno07 提案)
  2. :white_check_mark: 为所有底层芯片读取操作增加超时与重试限制
  3. :white_check_mark: 收集挂死时刻的 gdbstrace 跟踪数据以确认阻塞点
  4. :white_check_mark: 提交 Issue 至 GitCode - openUBMC/storage 跟踪修复

关键词Read chip failed, del_object failed, NoReply, VirtualVPDConnect, dtor, task not stopped

看一下00000008 和0000000f 分别是哪个组件,看下这个组件之间调用了哪些资源树方法,看下这些资源树方法实现是否可能会导致死循环

openUBMC2512出现hwproxy组件挂死情况 - 交流互助区 - openUBMC 论坛一样,需要更新libmc4lua到1.100.79之后的版本解决,即LTS SP1或者2603版本