自定义事件触发问题

在已有事件中没有发现相关事件告警,于是自定义一个pciecard相关的fatal error和hbm超温告警事件,根据社区文档,在event_def.json中添加以下:

EventDefinition:

    {

        "EventCode": "0x08000027",

        "ReportChannel": 65535,

        "OldEventCode": "",

        "EventType": 0,

        "LifeCycleId": 1,

        "DeassertFlag": 1,

        "EventKeyId": "PCIeCard.PCIeCardFatalErr",

        "SeverityId": 3,

        "ActionId": 0,

        "EventName": "PCIeCardFatalErr"

    },

    {

        "EventCode": "0x08000029",

        "ReportChannel": 65535,

        "OldEventCode": "",

        "EventType": 0,

        "LifeCycleId": 1,

        "DeassertFlag": 1,

        "EventKeyId": "PCIeCard.PCIeCardHBMOverTemp",

        "SeverityId": 2,

        "ActionId": 0,

        "EventName": "PCIeCardHBMOverTemp"

    }

EventDescription:

    {

        "Suggestion": {

            "En": "1. After restarting the OS, confirm whether the fault is eliminated.@#AB;2. Contact the system administrator to reset the card.@#AB;3.Check hardware connections - PCIe slots, cable connections.",

            "Zh": "1、重启OS后确认故障是否消除。@#AB;2、联系系统管理员重置卡。@#AB;3、检查硬件连接 - PCIe插槽、线缆连接。"

        },

        "EventKeyId": "PCIeCard.PCIeCardFatalErr",

        "Description": {

            "En": "The %1 %2 %3 triggered an fatal error. Error code: %4.",

            "Zh": "%1 PCIe卡%2(%3)触发了致命错误。故障码:%4。"

        },

        "Influence": {

            "En": "The PCIe card may run unstably, and the system may stop responding.",

            "Zh": "可能导致PCIe卡运行不稳定,系统停止响应。"

        },

        "Cause": {

            "En": "1. The PCIe card is faulty.@#AB;2. The mainboard is faulty.",

            "Zh": "1、PCIe卡故障。@#AB;2、主板故障"

        }

    },

    {

        "Suggestion": {

            "En": "1. Check the equipment room temperature.@#AB;2. Check for fan alarms.@#AB;3. Check for air inlet or outlet blockage.@#AB;4. Replace the pciecard where the pcie slot is located.",

            "Zh": "1、检查机房环境温度是否已超出设备运行环境要求。@#AB;2、检查服务器是否同时存在风扇告警。@#AB;3、检查服务器进风口/出风口是否有异物堵塞。@#AB;4、更换PCIe槽位所在的板卡。"

        },

        "EventKeyId": "PCIeCard.PCIeCardHBMOverTemp",

        "Description": {

            "En": "The %1 %2 %3 HBM current temperature exceeds 95°C (Frequency reduction has been triggered). Error code: %4.",

            "Zh": "%1 PCIe卡%2(%3) HBM 的当前温度高于95℃(已触发降频)。故障码:%4。"

        },

        "Influence": {

            "En": "The PCIe card may run at a reduced frequency and unstable, and the system may slow responding.",

            "Zh": "可能导致PCIe卡降频运行,不稳定,系统响应缓慢。"

        },

        "Cause": {

            "En": "1. The ambient temperature is too high.@#AB;2. The air inlet or outlet is blocked.@#AB;3. The fan module is faulty.@#AB;4. The HBM is faulty.",

            "Zh": "1、环境温度过高。@#AB;2、进风口/出风口堵塞。@#AB;3、风扇模块故障。@#AB;4、PCIe card HBM故障。"

        }

    }

添加以上后在eventDefList.txt中添加以下:

PCIeCard.PCIeCardHBMOverTemp
PCIeCard.PCIeCardFatalErr

然后配置了事件的csr配置:

    "Event_PCIeCardFE": {

        "EventKeyId": "PCIeCard.PCIeCardFatalErr",

        "Condition": 1,

        "Reading": "<=/PCIeDevice_1.FatalError",

        "OperatorId": 5,

        "Enabled": true,

        "AdditionalInfo": "2",

        "DescArg2": "#/Component_PCIeCard.Name",

        "DescArg4": "NA",

        "Component": "#/Component_PCIeCard",

        "LedFaultCode": "q$$"

    },

    "Event_PCIeCardHBMOT": {

        "EventKeyId": "PCIeCard.PCIeCardHBMOverTemp",

        "Condition": 1,

        "Reading": "<=/GPU_1.HBMTempWarning",

        "OperatorId": 5,

        "Enabled": true,

        "AdditionalInfo": "2",

        "DescArg2": "#/Component_PCIeCard.Name",

        "DescArg4": "NA",

        "Component": "#/Component_PCIeCard"

    }

以上第一个事件Event_PCIeCardFE可以触发,但是更改了EventCode还是显示的是之前配置的事件码,之前事件码是:0x800000B1(这个是起初在event_def.json里配置的),不是之后配置的0x08000027,第二个事件则没有被触发,已知触发条件已达到:

image

问题:1.事件一:修改EventCode后事件码为什么还是之前的?

2.事件2没有被触发,是否也和EventCode有关?

目前已经尝试的手段:1.对调2个事件Reading,发现也只触发0x800000B1—>排除事件没有达到触发条件原因

                  2.EventCode改回之前0x800000B1,事件2改为0x800000B3,事件2也没触发--->不确定是否与事件码有关
                  3./opt/bmc/conf目录下查看event_def.json,确认此文件已修改--->排除出包时文件没有被编译到原因
                  
      

重新查看事件定制 | 文档中心 | openUBMC和参考openubmc相关事件添加提交,发现是要更新event_def.json的版本号,更新后可以触发了

之前看到的eventcode不一致是因为我把修改的version信息又改回去了 :sweat_smile: