【已评审】支持训练场景xpu调频调压能力的资源协作接口、Redfish接口及IPMI命令

背景

通过在多个大模型训练过程中对AI服务器的实时资源使用率进行监测,发现在长时间稳定训练阶段,NPU一直处于高负荷工作状态,而CPU则运行在低负荷。BMC侧需提供北向Redfish接口等,允许上层网管或客户设置及查询节能模式和节能参数,并根据模式和参数向BMA/NPM传递调频参数,以实现在大模型训练时的节能能力。

关联ISSUE

[需求]: 支持训练场景xpu调频调压能力 - openUBMC/mdb_interface - GitCode

评审点

1、新增场景化节能是否支持、节能模式的资源协作接口属性
2、新增获取场景化节能使能状态的资源协作接口方法
3、新增场景化节能模式查询Redfish资源
4、新增场景化节能模式设置Redfish资源
5、新增场景化节能使能状态查询Redfish资源
6、新增查询OS能效信息的IPMI命令
7、新增设置OS能效信息的IPMI命令

详细描述

1、新增资源协作接口属性IsPowerModeSupported、PowerMode

资源path: /bmc/kepler/Chassis/:ChassisId/EnergySavingScene(已存在)
资源interface: bmc.kepler.Chassis.EnergySavingScene(已存在)
变化类型:新增属性
应用场景:上层网管或客户设置/查询节能模式
详细描述:

属性名称 变化类型 签名 读写&权限 持久化 变化通知 属性来源 接口说明 接口约束
IsPowerModeSupported 新增属性 b 读:ReadOnly 不持久化 false CSR 表示系统当前是否支持场景化节能
false 不支持(默认)
true 支持
PowerMode 新增属性 s 写:PowerMgmt
读:ReadOnly
掉电持久化 false 用户设置 表示系统当前的节能模式
取值范围:
BalancedPerformance(默认)
OSControlled
EfficiencyFavorPerformance
EfficiencyFavorPower
MaximumPerformance
PowerSaving
Static
OEM

2、新增资源协作接口方法GetEnergySavingStatus

资源path: /bmc/kepler/Chassis/:ChassisId/EnergySavingScene(已存在)
资源interface: bmc.kepler.Chassis.EnergySavingScene(已存在)
变化类型:新增方法
应用场景:上层网管或客户查询场景化节能使能状态
详细描述:

方法名称 变化类型 请求签名 请求参数 响应签名 响应参数 访问权限 接口说明 接口约束
GetEnergySavingStatus 新增方法 s EnergySavingStatus ReadOnly 获取场景化节能使能状态
EnergySavingStatus取值范围:
Activated 生效
Inactivated 不生效
Unknown 未知

3、新增场景化节能模式查询Redfish资源(标准资源)

uri:https://device_ip/redfish/v1/Systems/{system_id}
变化类型:新增属性
操作类型:GET
应用场景:上层网管或客户查询节能模式
详细描述:

属性名称 取值类型 说明 取值范围 默认值 操作权限 是否频繁变化并需要屏蔽变化事件 约束
PowerMode string
null
表示系统当前的节能模式 BalancedPerformance(默认)
OSControlled
EfficiencyFavorPerformance
EfficiencyFavorPower
MaximumPerformance
PowerSaving
Static
OEM
支持场景化节能时,默认值为BalancedPerformance;
不支持场景化节能时,取值为null
ReadOnly

schema说明
标准资源,schema需升级至v1_22_0
redfish.dmtf.org/schemas/v1/ComputerSystem.v1_22_0.json

"PowerMode": {
    "anyOf": [
        {
            "$ref": "#/definitions/PowerMode"
        },
        {
            "type": "null"
        }
    ],
    "description": "The power mode setting of the computer system.",
    "longDescription": "This property shall contain the computer system power mode setting.",
    "readonly": false,
    "versionAdded": "v1_15_0"
}
"PowerMode": {
    "enum": [
        "MaximumPerformance",
        "BalancedPerformance",
        "PowerSaving",
        "Static",
        "OSControlled",
        "OEM",
        "EfficiencyFavorPower",
        "EfficiencyFavorPerformance"
    ],
    "enumDescriptions": {
        "BalancedPerformance": "The system performs at the highest speeds while utilization is high and performs at reduced speeds when the utilization is low.",
        "EfficiencyFavorPerformance": "The system performs at reduced speeds at all utilizations to save power while attempting to maintain performance.  This mode differs from `EfficiencyFavorPower` in that more performance is retained but less power is saved.",
        "EfficiencyFavorPower": "The system performs at reduced speeds at all utilizations to save power at the cost of performance.  This mode differs from `PowerSaving` in that more performance is retained and less power is saved.  This mode differs from `EfficiencyFavorPerformance` in that less performance is retained but more power is saved.",
        "MaximumPerformance": "The system performs at the highest speeds possible.",
        "OEM": "The system power mode is OEM-defined.",
        "OSControlled": "The system power mode is controlled by the operating system.",
        "PowerSaving": "The system performs at reduced speeds to save power.",
        "Static": "The system power mode is static."
    },
    "enumLongDescriptions": {
        "BalancedPerformance": "This value shall indicate the system performs at the highest speeds possible when the utilization is high and performs at reduced speeds when the utilization is low to save power.  This mode is a compromise between `MaximumPerformance` and `PowerSaving`.",
        "EfficiencyFavorPerformance": "This value shall indicate the system performs at reduced speeds at all utilizations to save power while attempting to maintain performance.  This mode differs from `EfficiencyFavorPower` in that more performance is retained but less power is saved. This mode differs from 'MaximumPerformance' in that power is saved at the cost of some performance.  This mode differs from 'BalancedPerformance' in that power saving occurs at all utilizations.",
        "EfficiencyFavorPower": "This value shall indicate the system performs at reduced speeds at all utilizations to save power at the cost of performance.  This mode differs from `PowerSaving` in that more performance is retained and less power is saved.  This mode differs from `EfficiencyFavorPerformance` in that less performance is retained but more power is saved. This mode differs from 'BalancedPerformance' in that power saving occurs at all utilizations.",
        "MaximumPerformance": "This value shall indicate the system performs at the highest speeds possible.  This mode should be used when performance is the top priority.",
        "OEM": "This value shall indicate the system performs at an OEM-defined power mode.",
        "OSControlled": "This value shall indicate the system performs at an operating system-controlled power mode.",
        "PowerSaving": "This value shall indicate the system performs at reduced speeds to save power.  This mode should be used when power saving is the top priority.",
        "Static": "This value shall indicate the system performs at a static base speed."
    },
    "enumVersionAdded": {
        "EfficiencyFavorPerformance": "v1_22_0",
        "EfficiencyFavorPower": "v1_22_0"
    },
    "type": "string"
}

示例

请求头: X-Auth-Token: auth_value
请求消息体:无
响应样例:
{
    "@odata.context": "/redfish/v1/$metadata#ComputerSystem.ComputerSystem",
    "@odata.id":"/redfish/v1/Systems/1",
    "@odata.type":"#ComputerSystem.v1_2_0.ComputerSystem",
    "Id":"1",
    "Name": "ComputerSystem",
    ...
    "PowerMode": "BalancedPerformance",
    ...
}

4、新增场景化节能模式设置Redfish资源(标准资源)

uri:https://device_ip/redfish/v1/Systems/{system_id}
变化类型:新增属性
操作类型:PATCH
应用场景:上层网管或客户设置节能模式
详细描述:

属性名称 取值类型 说明 取值范围 默认值 操作权限 是否频繁变化并需要屏蔽变化事件 约束
PowerMode string
null
表示系统当前的节能模式 BalancedPerformance(默认)
OSControlled
EfficiencyFavorPerformance
EfficiencyFavorPower
MaximumPerformance
PowerSaving
Static
OEM
支持场景化节能时,默认值为BalancedPerformance;
不支持场景化节能时,取值为null
PowerMgmt

schema说明
标准资源,schema需升级至v1_22_0
redfish.dmtf.org/schemas/v1/ComputerSystem.v1_22_0.json

"PowerMode": {
    "anyOf": [
        {
            "$ref": "#/definitions/PowerMode"
        },
        {
            "type": "null"
        }
    ],
    "description": "The power mode setting of the computer system.",
    "longDescription": "This property shall contain the computer system power mode setting.",
    "readonly": false,
    "versionAdded": "v1_15_0"
}
"PowerMode": {
    "enum": [
        "MaximumPerformance",
        "BalancedPerformance",
        "PowerSaving",
        "Static",
        "OSControlled",
        "OEM",
        "EfficiencyFavorPower",
        "EfficiencyFavorPerformance"
    ],
    "enumDescriptions": {
        "BalancedPerformance": "The system performs at the highest speeds while utilization is high and performs at reduced speeds when the utilization is low.",
        "EfficiencyFavorPerformance": "The system performs at reduced speeds at all utilizations to save power while attempting to maintain performance.  This mode differs from `EfficiencyFavorPower` in that more performance is retained but less power is saved.",
        "EfficiencyFavorPower": "The system performs at reduced speeds at all utilizations to save power at the cost of performance.  This mode differs from `PowerSaving` in that more performance is retained and less power is saved.  This mode differs from `EfficiencyFavorPerformance` in that less performance is retained but more power is saved.",
        "MaximumPerformance": "The system performs at the highest speeds possible.",
        "OEM": "The system power mode is OEM-defined.",
        "OSControlled": "The system power mode is controlled by the operating system.",
        "PowerSaving": "The system performs at reduced speeds to save power.",
        "Static": "The system power mode is static."
    },
    "enumLongDescriptions": {
        "BalancedPerformance": "This value shall indicate the system performs at the highest speeds possible when the utilization is high and performs at reduced speeds when the utilization is low to save power.  This mode is a compromise between `MaximumPerformance` and `PowerSaving`.",
        "EfficiencyFavorPerformance": "This value shall indicate the system performs at reduced speeds at all utilizations to save power while attempting to maintain performance.  This mode differs from `EfficiencyFavorPower` in that more performance is retained but less power is saved. This mode differs from 'MaximumPerformance' in that power is saved at the cost of some performance.  This mode differs from 'BalancedPerformance' in that power saving occurs at all utilizations.",
        "EfficiencyFavorPower": "This value shall indicate the system performs at reduced speeds at all utilizations to save power at the cost of performance.  This mode differs from `PowerSaving` in that more performance is retained and less power is saved.  This mode differs from `EfficiencyFavorPerformance` in that less performance is retained but more power is saved. This mode differs from 'BalancedPerformance' in that power saving occurs at all utilizations.",
        "MaximumPerformance": "This value shall indicate the system performs at the highest speeds possible.  This mode should be used when performance is the top priority.",
        "OEM": "This value shall indicate the system performs at an OEM-defined power mode.",
        "OSControlled": "This value shall indicate the system performs at an operating system-controlled power mode.",
        "PowerSaving": "This value shall indicate the system performs at reduced speeds to save power.  This mode should be used when power saving is the top priority.",
        "Static": "This value shall indicate the system performs at a static base speed."
    },
    "enumVersionAdded": {
        "EfficiencyFavorPerformance": "v1_22_0",
        "EfficiencyFavorPower": "v1_22_0"
    },
    "type": "string"
}

示例

请求头:
X-Auth-Token: auth_value
Content-Type: header_type
If-Match: ifmatch_value
请求消息体:
{
    "PowerMode": "BalancedPerformance"
}
响应样例:
{
    "@odata.context": "/redfish/v1/$metadata#ComputerSystem.ComputerSystem",
    "@odata.id":"/redfish/v1/Systems/1",
    "@odata.type":"#ComputerSystem.v1_2_0.ComputerSystem",
    "Id":"1",
    "Name": "ComputerSystem",
    ...
    "PowerMode": "BalancedPerformance",
    ...
}

5、新增场景化节能使能状态查询Redfish资源(OEM资源)

uri:https://device_ip/redfish/v1/Managers/{manager_id}/EnergySavingService
变化类型:新增属性
操作类型:GET
应用场景:上层网管或客户查询场景化节能使能状态
详细描述:

属性名称 取值类型 说明 取值范围 默认值 操作权限 是否频繁变化并需要屏蔽变化事件 约束
EnergySavingStatus string
null
表示系统当前的场景化节能使能状态 Activated 生效
Inactivated 不生效
Unknown 未知
支持场景化节能时,为实时获取的场景化节能使能状态;不支持场景化节能时,取值为null ReadOnly

schema说明

"EnergySavingStatus": {
    "type": [
        "string",
        "null"
    ],
    "readonly": true,
    "description": "The energy saving status of the computer system.",
    "longDescription": "This property shall contain the computer system energy saving status.",
    "enum": [
        "Activated",
        "Inactivated",
        "Unknown"
    ],
    "enumDescriptions": {
        "Activated": "The system energy saving status is activated.",
        "Inactivated": "The system energy saving status is inactivated.",
        "Unknown": "The system energy saving status is unknown."
    },
    "enumLongDescriptions": {
        "Activated": "This value indicates that the energy-saving system is present and actived.",
        "Inactivated": "This value indicates that the energy-saving system is present but not activated.",
        "Unknown": "This value indicates that an energy-saving system is present, but its activation status cannot be obtained at the moment."
    },
}

示例

请求头: X-Auth-Token: auth_value
请求消息体:无
响应样例:
{
    "@odata.context": "/redfish/v1/$metadata#EnergySavingService.EnergySavingService",
    "@odata.id": "/redfish/v1/Managers/1/EnergySavingService",
    "@odata.type": "#EnergySavingService.v1_0_0.EnergySavingService",
    "Description": "EnergySavingService Settings",
    "Id": "EnergySavingService",
    "Name": "EnergySavingService",
    "DynamicEnergySavingScene": "Default",
    "EnergySavingStatus": "Unknown",
    "Actions": {
        "#EnergySavingService.SetScene": {
            "target": "/redfish/v1/Managers/1/EnergySavingService/Actions/EnergySavingService.SetScene",
            "@Redfish.ActionInfo": "/redfish/v1/Managers/1/EnergySavingService/SetSceneActionInfo"
        }
    }
}

6、新增查询OS能效信息的IPMI命令(OS进程调用)

IPMI命令字:netfn 30h,cmd 92h
变化类型:新增参数取值
应用场景:OS进程调用查询能效信息
操作类型:GET
操作权限:ReadOnly
详细描述:
参数说明

字节 名称 取值说明
1:3 Manufacturer ID 0x0007db,低字节优先
4 Sub Command 子命令,68h
5 Requester Identifier 01h - Node Power Management
02h - System Management Software
6 Parameter Selector 01h - OS能效配置
02h - OS节能状态
7:8 Offset 读取的数据相对于起始位置的偏移,低字节优先
9 Length 读取的数据长度

响应说明

字节 名称 取值说明
1 Completion Code 完成码
00h Command Completed Normally
D5h Cannot execute command
2:4 Manufacturer ID 0x0007db,低字节优先
5 Flag 完成标识符
00h - 未完成
01h - 已完成
6 Data Format 数据类型
01h - json
02h - xml
03h - tlv
7 Length 返回的数据长度
8:N Data 返回的数据

示例
请求
ipmitool raw 0x30 0x92 0xdb 0x07 0x00 0x68 0x01 0x01 0x00 0x00 0x17
响应
00 db 07 00 01 01 17 7b 22 43 50 55 46 72 65 71 75 65 6e 63 79 4c 65 76 65 6c 22 3a 34 7d

7、新增设置OS能效信息的IPMI命令(OS进程调用)

IPMI命令字:netfn 30h,cmd 92h
变化类型:新增参数取值
应用场景:OS进程调用设置能效信息
操作类型:SET
操作权限:PowerMgmt
详细描述:
参数说明

字节 名称 取值说明
1:3 Manufacturer ID 0x0007db,低字节优先
4 Sub command 子命令,69h
5 Requester Identifier 01h - Node Power Management
02h - System Management Software
6 Parameter Selector 01h - OS能效配置
02h - OS节能状态
7 Data Format 数据类型
01h - json
02h - xml
03h - tlv
8 Operation 00h - Write Prepare
01h - Write Data
03h - Write Finish
(9) Data Checksum 当Operation为Write Data时,本字段表示写入的数据内容累加和。
当Operation为Write Finish时,本字段表示整个文件的累加和。
本字段可选,仅当Operation为Write Data和Write Finish时需要提供本字段。
(10:11) Offset 当Operation为Write Prepare时:File Size, LSB First,文件大小。
当Operation为Write Data时:Offset to write,LSB First,数据段相对于File开始位置的偏移。
本字段可选,仅当Operation为Write Prepare,Write Data时需要提供本字段。
(12:n) Data 写入的数据
本字段可选,仅当Operation为Write Data时需要提供本字段。

响应说明

字节 名称 取值说明
1 Completion Code 完成码
00h Command Completed Normally
80h Checksum Failed
D5h Cannot execute command
2:4 Manufacturer ID 0x0007db,低字节优先

示例
请求
ipmitool raw 0x30 0x92 0xdb 0x07 0x00 0x69 0x01 0x02 0x01 0x01 0x01 0x01 0x00 0x01
响应
00 db 07 00

评审结论

决策点1:同意新增资源协作接口属性表示是否支持场景化节能,以及当前的节能模式,具体如下:

path: /bmc/kepler/Chassis/:ChassisId/EnergySavingScene
interface: bmc.kepler.Chassis.EnergySavingScene
变化类型:新增属性

属性名称 签名 读写&权限 持久化 变化通知 属性来源
IsPowerModeSupported b 只读
读:ReadOnly
不持久化 false CSR
PowerMode s 读写
写:PowerMgmt
读:ReadOnly
掉电持久化 false 用户设置

决策点2:同意新增资源协作接口方法获取系统当前场景化节能使能状态,具体如下:

path: /bmc/kepler/Chassis/:ChassisId/EnergySavingScene
interface: bmc.kepler.Chassis.EnergySavingScene
变化类型:新增方法

方法名称 请求签名 响应签名 访问权限 接口说明
GetEnergySavingStatus NA s ReadOnly 获取场景化节能使能状态

决策点3&4:同意redfish接口新增属性查询和设置场景化节能模式,具体如下:

uri:/redfish/v1/Systems/{SystemId}
操作类型:GET、PATCH
变化类型:新增属性

属性名称 取值类型 取值范围 操作权限 是否频繁变化并需要屏蔽变化事件 约束
PowerMode string
null
BalancedPerformance(默认)
OSControlled
EfficiencyFavorPerformance
EfficiencyFavorPower
MaximumPerformance
PowerSaving
Static
OEM
GET:ReadOnly
PATCH:PowerMgmt
不支持场景化节能时取值为null

决策点5:同意redfish接口新增OEM属性查询场景化节能使能状态,具体如下:

uri:/redfish/v1/Managers/{ManagerId}/EnergySavingService
操作类型:GET
变化类型:新增属性

属性名称 取值类型 取值范围 操作权限 是否频繁变化并需要屏蔽变化事件 约束
EnergySavingStatus string
null
Activated 生效
Inactivated 不生效
Unknown 未知
ReadOnly

决策点6:同意新增查询OS能效信息的IPMI命令(仅带内通道,具体定义参见评审点描述)

netfn cmd sub command 操作类型 操作权限
30h 92h 68h GET ReadOnly

决策点7:同意新增设置OS能效信息的IPMI命令(仅带内通道,具体定义参见评审点描述)

netfn cmd sub command 操作类型 操作权限
30h 92h 69h SET PowerMgmt

遗留问题