在已有事件中没有发现相关事件告警,于是自定义一个pciecard相关的fatal error和hbm超温告警事件,根据社区文档,在event_def.json中添加以下:
EventDefinition:
{ "EventCode": "0x08000027", "ReportChannel": 65535, "OldEventCode": "", "EventType": 0, "LifeCycleId": 1, "DeassertFlag": 1, "EventKeyId": "PCIeCard.PCIeCardFatalErr", "SeverityId": 3, "ActionId": 0, "EventName": "PCIeCardFatalErr" }, { "EventCode": "0x08000029", "ReportChannel": 65535, "OldEventCode": "", "EventType": 0, "LifeCycleId": 1, "DeassertFlag": 1, "EventKeyId": "PCIeCard.PCIeCardHBMOverTemp", "SeverityId": 2, "ActionId": 0, "EventName": "PCIeCardHBMOverTemp" }
EventDescription:
{ "Suggestion": { "En": "1. After restarting the OS, confirm whether the fault is eliminated.@#AB;2. Contact the system administrator to reset the card.@#AB;3.Check hardware connections - PCIe slots, cable connections.", "Zh": "1、重启OS后确认故障是否消除。@#AB;2、联系系统管理员重置卡。@#AB;3、检查硬件连接 - PCIe插槽、线缆连接。" }, "EventKeyId": "PCIeCard.PCIeCardFatalErr", "Description": { "En": "The %1 %2 %3 triggered an fatal error. Error code: %4.", "Zh": "%1 PCIe卡%2(%3)触发了致命错误。故障码:%4。" }, "Influence": { "En": "The PCIe card may run unstably, and the system may stop responding.", "Zh": "可能导致PCIe卡运行不稳定,系统停止响应。" }, "Cause": { "En": "1. The PCIe card is faulty.@#AB;2. The mainboard is faulty.", "Zh": "1、PCIe卡故障。@#AB;2、主板故障" } }, { "Suggestion": { "En": "1. Check the equipment room temperature.@#AB;2. Check for fan alarms.@#AB;3. Check for air inlet or outlet blockage.@#AB;4. Replace the pciecard where the pcie slot is located.", "Zh": "1、检查机房环境温度是否已超出设备运行环境要求。@#AB;2、检查服务器是否同时存在风扇告警。@#AB;3、检查服务器进风口/出风口是否有异物堵塞。@#AB;4、更换PCIe槽位所在的板卡。" }, "EventKeyId": "PCIeCard.PCIeCardHBMOverTemp", "Description": { "En": "The %1 %2 %3 HBM current temperature exceeds 95°C (Frequency reduction has been triggered). Error code: %4.", "Zh": "%1 PCIe卡%2(%3) HBM 的当前温度高于95℃(已触发降频)。故障码:%4。" }, "Influence": { "En": "The PCIe card may run at a reduced frequency and unstable, and the system may slow responding.", "Zh": "可能导致PCIe卡降频运行,不稳定,系统响应缓慢。" }, "Cause": { "En": "1. The ambient temperature is too high.@#AB;2. The air inlet or outlet is blocked.@#AB;3. The fan module is faulty.@#AB;4. The HBM is faulty.", "Zh": "1、环境温度过高。@#AB;2、进风口/出风口堵塞。@#AB;3、风扇模块故障。@#AB;4、PCIe card HBM故障。" } }
添加以上后在eventDefList.txt中添加以下:
PCIeCard.PCIeCardHBMOverTemp
PCIeCard.PCIeCardFatalErr
然后配置了事件的csr配置:
"Event_PCIeCardFE": { "EventKeyId": "PCIeCard.PCIeCardFatalErr", "Condition": 1, "Reading": "<=/PCIeDevice_1.FatalError", "OperatorId": 5, "Enabled": true, "AdditionalInfo": "2", "DescArg2": "#/Component_PCIeCard.Name", "DescArg4": "NA", "Component": "#/Component_PCIeCard", "LedFaultCode": "q$$" }, "Event_PCIeCardHBMOT": { "EventKeyId": "PCIeCard.PCIeCardHBMOverTemp", "Condition": 1, "Reading": "<=/GPU_1.HBMTempWarning", "OperatorId": 5, "Enabled": true, "AdditionalInfo": "2", "DescArg2": "#/Component_PCIeCard.Name", "DescArg4": "NA", "Component": "#/Component_PCIeCard" }
以上第一个事件Event_PCIeCardFE可以触发,但是更改了EventCode还是显示的是之前配置的事件码,之前事件码是:0x800000B1(这个是起初在event_def.json里配置的),不是之后配置的0x08000027,第二个事件则没有被触发,已知触发条件已达到:
![]()
问题:1.事件一:修改EventCode后事件码为什么还是之前的?
2.事件2没有被触发,是否也和EventCode有关?
目前已经尝试的手段:1.对调2个事件Reading,发现也只触发0x800000B1—>排除事件没有达到触发条件原因
2.EventCode改回之前0x800000B1,事件2改为0x800000B3,事件2也没触发--->不确定是否与事件码有关
3./opt/bmc/conf目录下查看event_def.json,确认此文件已修改--->排除出包时文件没有被编译到原因
