NPU卡Chip 1温度传感器显示异常

设备持续做power cycle的操作,跑到105圈的时候,发现PCIe5 Chip 1的温度传感器无读值
PCIe5 设备为Atlas 300 IDUO的NPU卡

通过查看文件vpd\vendor\Huawei\Gpu\14140130_19e5d500_02000110.sr,NPU的CHIP温度传感器定义,温度从NPUCard_1.Core1TemperatureCelsius获取。

    "ThresholdSensor_GPUAICore1Temp": {
        "OwnerId": 32,
        "OwnerLun": 0,
        "EntityId": "<=/Entity_GPUCard.Id",
        "EntityInstance": "<=/Entity_GPUCard.Instance",
        "Initialization": 127,
        "Capabilities": 104,
        "SensorType": 1,
        "ReadingType": 1,
        "SensorName": "PCIe${Slot} Chip1 Temp",
        "AssertMask": 0,
        "DeassertMask": 0,
        "ReadingMask": 2056,
        "UpperNoncritical": 105,
        "PositiveHysteresis": 2,
        "NegativeHysteresis": 2,
        "Unit": 128,
        "BaseUnit": 1,
        "ModifierUnit": 0,
        "Linearization": 0,
        "M": 1,
        "RBExp": 0,
        "Analog": 1,
        "NominalReading": 25,
        "NormalMaximum": 0,
        "NormalMinimum": 0,
        "MaximumReading": 127,
        "MinimumReading": 128,
        "Reading": "<=/NPUCard_1.Core1TemperatureCelsius",
        "ReadingStatus": "<=/NPUCard_1.Core1TemperatureCelsius |> expr($1 >= 255 ? 1 : 0)"
    },

查看NPUCard的信息,Core0TemperatureCelsius和Core1TemperatureCelsius的读值都是存在的。

% lsobj NPUCard
NPUCard_1_0101010902

% lsprop NPUCard_1_0101010902
bmc.kepler.Inventory.Hardware
AssetName=“PCIeCard5”
AssetTag=“N/A”
AssetType=“PCIe NPU Card”
FirmwareVersion=“7.1.0.6.220”
ManufactureDate=“N/A”
Manufacturer=“Huawei”
Model=“Atlas_300I_Duo”
PCBVersion=“”
PartNumber=“03029WRV”
SerialNumber=“2106030737ZEPC013116”
Slot=“5”
UUID=“”
bmc.kepler.Object.Properties
ClassName=“NPUCard”
ObjectIdentifier=[1,“1”,“”,“0101010902”]
ObjectName=“NPUCard_1_0101010902”
TraceEnabled=false
bmc.kepler.Systems.NPUCard
BoardID=177
ChipFaultDescription=“”
ChipHealthStatus=0
Core0TemperatureCelsius=77
Core1TemperatureCelsius=75
FaultCode=“Error code: NA”
FaultState=0
FirmwareVersion=“7.1.0.6.220”
InletTemperatureCelsius=68
McuFirmwareVersion=“24.7.8”
MemoryCapacityMiB=87974
Name=“Atlas 300I Duo Inference Card”
OutletTemperatureCelsius=57
PcbVersion=“.E”
PowerState=0
PowerWatts=519
SerialNumber=“2106030737ZEPC013116”
SlotNumber=5
Private
CardDescription=“Atlas 300I Duo Inference Card PCI-E 1*16x(FHFL)”
CardPartNumber=“03029WRV”
DeviceName=“PCIe Card 5 (Atlas 300I Duo)”
LockChip=“Chip_Dmini_0101010902”
RefChip=“Chip_Dmini_0101010902”
RefEeprom=“Chip_Dmini_Elabel_0101010902”
RefFrudata=“FruData_NPUCard_0101010902”

查看传感器的读值,Reading值也是存在的,但是 ReadingDisplay值为-1.存在异常。
% lsprop ThresholdSensor_GPUAICore1Temp_0101010902
bmc.kepler.Object.Properties
ClassName=“ThresholdSensor”
ObjectIdentifier=[1,“1”,“”,“0101010902”]
ObjectName=“ThresholdSensor_GPUAICore1Temp_0101010902”
TraceEnabled=false
bmc.kepler.Systems.ThresholdSensor
Capabilities=104
EntityId=11
EntityInstance=101
LowerCritical=0
LowerNoncritical=0
LowerNonrecoverable=0
NegativeHysteresis=2
OriginalReading=0
OwnerLun=0
PositiveHysteresis=2
Reading=75
ReadingMask=2056
ReadingStatus=0
SensorIdentifier=“”
SensorName=“PCIe5 Chip1 Temp”
SensorNumber=162
SensorType=1
UpperCritical=0
UpperNoncritical=105
UpperNonrecoverable=0
bmc.kepler.Systems.ThresholdSensorDisplay
AssertStatus=0
Health=“OK”
LowerCriticalDisplay=“0.000”
LowerNoncriticalDisplay=“0.000”
LowerNonrecoverableDisplay=“0.000”
NegativeHysteresisDisplay=“2.000”
PositiveHysteresisDisplay=“2.000”
ReadingDisplay=“-1.000”
Status=“Enabled”
UnitDisplay=“degrees C”
UpperCriticalDisplay=“0.000”
UpperNoncriticalDisplay=“105.000”
UpperNonrecoverableDisplay=“0.000”
Private
Accuracy=0
Analog=1
AssertMask=0
B=0
BA=0
BaseUnit=1
BelongsToSystem=false
DeassertMask=0
Initialization=127
Linearization=0
M=1
MT=0
MaximumReading=127
MinimumReading=128
ModifierUnit=0
NominalReading=25
NormalMaximum=0
NormalMinimum=0
OwnerId=32
RBExp=0
ReadingType=1
Unit=128

尝试修改PCIe5 Chip1 的ReadingDisplay的值为80,PCIe5 Chip1 Temp传感器的值在WEB上还是无显示。
尝试修改PCIe5 Chip0 的ReadingDisplay的值为80,PCIe5 Chip0 Temp传感器的值在WEB上可以更新。

【问题单】传感器概率显示no reading-sensor-GitCode](GitCode - 全球开发者的开源社区,开源代码托管平台)