hk1089
(Hk1089)
1
问题描述
BMC Web中上报轻微告警,事件码为0x1A000043,事件描述为:The data written to the NAND flash in last 15 days exceeds 12G。
环境信息
重现步骤
-
BMC Web中上报轻微告警,事件码为0x1A000043,事件描述为:The data written to the NAND flash in last 15 days exceeds 12G。
-
一键收集日志中查看dump_info\AppDump\bmc_soc\nandflash_info.txt,“Total data written in 15 days”这一项实际超过12G,则确认是真实告警,非误告警。
期望结果
不应当有如此大量的nandflash写入量;即:不应当出现此类告警。
实际结果
NandFlash写入过量 (written to NAND fash in last 15 days exceeds 12G)
尝试过的解决方案
1.检查一键收集日志,发现半个月内日志容量大小不超过500M;
2.检查上传固件包的数量,然后推算容量,也远不及12G.
3.通过查阅场内wiki,注意到还可能存在文件重复读写、数据库频繁写入+写入放大等问题,但是没有手段跟踪。
linyao
(Linyao)
5
查看日志发现每份日志都存在如下类似调试日志的打印,每隔几分钟会打印一次。
可以排查一下是否是调试日志刷屏导致
2026-02-07 08:44:36.038675 compute ERROR: NPUCard.lua(661): time_out ========== 9
2026-02-07 08:44:36.040206 compute ERROR: NPUCard.lua(662): McuFirmwareVersion = 1 MemoryCapacityKiB = 1 Elabel = 0 DeviceTemperature = 1 PcbVersion = 1 Capabilities = 1 BoardID = 1 PowerWatts = 1 FirmwareVersion = 1 McuTime = 1
2026-02-07 08:44:41.029839 compute ERROR: NPUCard.lua(661): time_out ========== 8
2026-02-07 08:44:41.030448 compute ERROR: NPUCard.lua(662): McuFirmwareVersion = 1 MemoryCapacityKiB = 1 Elabel = 0 DeviceTemperature = 1 PcbVersion = 1 Capabilities = 1 BoardID = 1 PowerWatts = 1 FirmwareVersion = 1 McuTime = 1
2026-02-07 08:44:46.031416 compute ERROR: NPUCard.lua(661): time_out ========== 7
2026-02-07 08:44:46.032093 compute ERROR: NPUCard.lua(662): McuFirmwareVersion = 1 MemoryCapacityKiB = 1 Elabel = 0 DeviceTemperature = 1 PcbVersion = 1 Capabilities = 1 BoardID = 1 PowerWatts = 1 FirmwareVersion = 1 McuTime = 1
2026-02-07 08:44:46.847976 bios NOTICE: bios_object_mutihost.lua(842): [bios]wait password ack timeout: reset password(0) status.
2026-02-07 08:44:46.878110 bios NOTICE: bios_object_mutihost.lua(842): [bios]wait password ack timeout: reset password(1) status.
2026-02-07 08:44:51.033438 compute ERROR: NPUCard.lua(661): time_out ========== 6
2026-02-07 08:44:51.034867 compute ERROR: NPUCard.lua(662): McuFirmwareVersion = 1 MemoryCapacityKiB = 1 Elabel = 0 DeviceTemperature = 1 PcbVersion = 1 Capabilities = 1 BoardID = 1 PowerWatts = 1 FirmwareVersion = 1 McuTime = 1
2026-02-07 08:44:56.033728 compute ERROR: NPUCard.lua(661): time_out ========== 5
2026-02-07 08:44:56.034320 compute ERROR: NPUCard.lua(662): McuFirmwareVersion = 1 MemoryCapacityKiB = 1 Elabel = 0 DeviceTemperature = 1 PcbVersion = 1 Capabilities = 1 BoardID = 1 PowerWatts = 1 FirmwareVersion = 1 McuTime = 1
2026-02-07 08:45:01.031061 compute ERROR: NPUCard.lua(661): time_out ========== 4
2026-02-07 08:45:01.031595 compute ERROR: NPUCard.lua(662): McuFirmwareVersion = 1 MemoryCapacityKiB = 1 Elabel = 0 DeviceTemperature = 1 PcbVersion = 1 Capabilities = 1 BoardID = 1 PowerWatts = 1 FirmwareVersion = 1 McuTime = 1
2026-02-07 08:45:06.034851 compute ERROR: NPUCard.lua(661): time_out ========== 3
2026-02-07 08:45:06.035345 compute ERROR: NPUCard.lua(662): McuFirmwareVersion = 1 MemoryCapacityKiB = 1 Elabel = 0 DeviceTemperature = 1 PcbVersion = 1 Capabilities = 1 BoardID = 1 PowerWatts = 1 FirmwareVersion = 1 McuTime = 1
2026-02-07 08:45:11.030327 compute ERROR: NPUCard.lua(661): time_out ========== 2
2026-02-07 08:45:11.032033 compute ERROR: NPUCard.lua(662): McuFirmwareVersion = 1 MemoryCapacityKiB = 1 Elabel = 0 DeviceTemperature = 1 PcbVersion = 1 Capabilities = 1 BoardID = 1 PowerWatts = 1 FirmwareVersion = 1 McuTime = 1
2026-02-07 08:45:11.984740 bios NOTICE: bdf_service.lua(266): write_one_frame: pcie bdf reported, set bios boot stage completed.
hk1089
(Hk1089)
8