Add doc for accelerator function (#1252)

* Add Math Evaluation with Judge Model Evaluator * Add Math Evaluation with Judge Model Evaluator * Add Math Evaluation with Judge Model Evaluator * Add Math Evaluation with Judge Model Evaluator * Fix Llama-3 meta template * Fix MATH with JudgeLM Evaluation * Fix MATH with JudgeLM Evaluation * Fix MATH with JudgeLM Evaluation * Fix MATH with JudgeLM Evaluation * Update acclerator * Update MathBench * Update accelerator * Add Doc for accelerator * Add Doc for accelerator * Add Doc for accelerator * Add Doc for accelerator --------- Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2025-05-30 16:03:24 +08:00 · 2024-06-24 14:53:51 +08:00 · 2024-06-24 14:53:51 +08:00 · e5ee1647fb
commit e5ee1647fb
parent 1fa62c4a42
6 changed files with 297 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -70,6 +70,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through

 ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

+- **\[2024.06.20\]** OpenCompass now supports one-click switching between inference acceleration backends, enhancing the efficiency of the evaluation process. In addition to the default HuggingFace inference backend, it now also supports popular backends [LMDeploy](https://github.com/InternLM/lmdeploy) and [vLLM](https://github.com/vllm-project/vllm). This feature is available via a simple command-line switch and through deployment APIs. For detailed usage, see the [documentation](docs/en/advanced_guides/accelerator_intro.md).🔥🔥🔥.
 - **\[2024.05.08\]** We supported the evaluation of 4 MoE models: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py). Try them out now!
 - **\[2024.04.30\]** We supported evaluating a model's compression efficiency by calculating its Bits per Character (BPC) metric on an [external corpora](configs/datasets/llm_compression/README.md) ([official paper](https://github.com/hkust-nlp/llm-compression-intelligence)). Check out the [llm-compression](configs/eval_llm_compression.py) evaluation config now! 🔥🔥🔥
 - **\[2024.04.29\]** We report the performance of several famous LLMs on the common benchmarks, welcome to [documentation](https://opencompass.readthedocs.io/en/latest/user_guides/corebench.html) for more information! 🔥🔥🔥.
@ -150,6 +151,12 @@ After ensuring that OpenCompass is installed correctly according to the above st
 python run.py --models hf_llama_7b --datasets mmlu_ppl ceval_ppl
 ```

+Additionally, if you want to use an inference backend other than HuggingFace for accelerated evaluation, such as LMDeploy or vLLM, you can do so with the command below. Please ensure that you have installed the necessary packages for the chosen backend and that your model supports accelerated inference with it. For more information, see the documentation on inference acceleration backends [here](docs/en/advanced_guides/accelerator_intro.md). Below is an example using LMDeploy:
+
+```bash
+python run.py --models hf_llama_7b --datasets mmlu_ppl ceval_ppl -a lmdeploy
+```
+
 OpenCompass has predefined configurations for many models and datasets. You can list all available model and dataset configurations using the [tools](./docs/en/tools.md#list-configs).

 ```bash
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -69,6 +69,8 @@

 ## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

+- **\[2024.06.20\]** OpenCompass 现已支持一键切换推理加速后端，助力评测过程更加高效。除了默认的HuggingFace推理后端外，还支持了常用的 [LMDeploy](https://github.com/InternLM/lmdeploy) 和 [vLLM](https://github.com/vllm-project/vllm) ，支持命令行一键切换和部署 API 加速服务两种方式，详细使用方法见[文档](docs/zh_cn/advanced_guides/accelerator_intro.md)。
+  欢迎试用！🔥🔥🔥.
 - **\[2024.05.08\]** 我们支持了以下四个MoE模型的评测配置文件: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py) 。欢迎试用!
 - **\[2024.04.30\]** 我们支持了计算模型在给定[数据集](configs/datasets/llm_compression/README.md)上的压缩率（Bits per Character）的评测方法（[官方文献](https://github.com/hkust-nlp/llm-compression-intelligence)）。欢迎试用[llm-compression](configs/eval_llm_compression.py)评测集! 🔥🔥🔥
 - **\[2024.04.26\]** 我们报告了典型LLM在常用基准测试上的表现，欢迎访问[文档](https://opencompass.readthedocs.io/zh-cn/latest/user_guides/corebench.html)以获取更多信息！🔥🔥🔥.
@ -151,6 +153,12 @@ unzip OpenCompassData-core-20240207.zip
 python run.py --models hf_llama_7b --datasets mmlu_ppl ceval_ppl
 ```

+另外，如果想使用除了 HuggingFace 外的推理后端进行加速评测，如 LMDeploy 或 vLLM，可以通过以下命令。使用前请确保您已经安装了相应后端的软件包，以及模型支持使用该后端进行加速推理，更多内容见推理加速后端[文档](docs/zh_cn/advanced_guides/accelerator_intro.md)，下面以LMDeploy为例：
+
+```bash
+python run.py --models hf_llama_7b --datasets mmlu_ppl ceval_ppl -a lmdeploy
+```
+
 OpenCompass 预定义了许多模型和数据集的配置，你可以通过 [工具](./docs/zh_cn/tools.md#ListConfigs) 列出所有可用的模型和数据集配置。

 ```bash
--- a/docs/en/advanced_guides/accelerator_intro.md
+++ b/docs/en/advanced_guides/accelerator_intro.md
@ -0,0 +1,140 @@
+# Accelerate Evaluation Inference with vLLM or LMDeploy
+
+## Background
+
+During the OpenCompass evaluation process, the Huggingface transformers library is used for inference by default. While this is a very general solution, there are scenarios where more efficient inference methods are needed to speed up the process, such as leveraging VLLM or LMDeploy.
+
+- [LMDeploy](https://github.com/InternLM/lmdeploy) is a toolkit designed for compressing, deploying, and serving large language models (LLMs), developed by the [MMRazor](https://github.com/open-mmlab/mmrazor) and [MMDeploy](https://github.com/open-mmlab/mmdeploy) teams.
+- [vLLM](https://github.com/vllm-project/vllm) is a fast and user-friendly library for LLM inference and serving, featuring advanced serving throughput, efficient PagedAttention memory management, continuous batching of requests, fast model execution via CUDA/HIP graphs, quantization techniques (e.g., GPTQ, AWQ, SqueezeLLM, FP8 KV Cache), and optimized CUDA kernels.
+
+## Preparation for Acceleration
+
+First, check whether the model you want to evaluate supports inference acceleration using vLLM or LMDeploy. Additionally, ensure you have installed vLLM or LMDeploy as per their official documentation. Below are the installation methods for reference:
+
+### LMDeploy Installation Method
+
+Install LMDeploy using pip (Python 3.8+) or from [source](https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md):
+
+```bash
+pip install lmdeploy
+```
+
+### VLLM Installation Method
+
+Install vLLM using pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):
+
+```bash
+pip install vllm
+```
+
+## Accelerated Evaluation Using VLLM or LMDeploy
+
+### Method 1: Using Command Line Parameters to Change the Inference Backend
+
+OpenCompass offers one-click evaluation acceleration. During evaluation, it can automatically convert Huggingface transformer models to VLLM or LMDeploy models for use. Below is an example code for evaluating the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model:
+
+```python
+# eval_gsm8k.py
+from mmengine.config import read_base
+
+with read_base():
+    # Select a dataset list
+    from .datasets.gsm8k.gsm8k_0shot_gen_a58960 import gsm8k_datasets as datasets
+    # Select an interested model
+    from ..models.hf_llama.hf_llama3_8b_instruct import models
+```
+
+Here, `hf_llama3_8b_instruct` specifies the original Huggingface model configuration, as shown below:
+
+```python
+from opencompass.models import HuggingFacewithChatTemplate
+
+models = [
+    dict(
+        type=HuggingFacewithChatTemplate,
+        abbr='llama-3-8b-instruct-hf',
+        path='meta-llama/Meta-Llama-3-8B-Instruct',
+        max_out_len=1024,
+        batch_size=8,
+        run_cfg=dict(num_gpus=1),
+        stop_words=['<|end_of_text|>', '<|eot_id|>'],
+    )
+]
+```
+
+To evaluate the GSM8k dataset using the default Huggingface version of the llama3-8b-instruct model, use:
+
+```bash
+python run.py config/eval_gsm8k.py
+```
+
+To accelerate the evaluation using vLLM or LMDeploy, you can use the following script:
+
+```bash
+python run.py config/eval_gsm8k.py -a vllm
+```
+
+or
+
+```bash
+python run.py config/eval_gsm8k.py -a lmdeploy
+```
+
+### Method 2: Accelerating Evaluation via Deployed Inference Acceleration Service API
+
+OpenCompass also supports accelerating evaluation by deploying vLLM or LMDeploy inference acceleration service APIs. Follow these steps:
+
+1. Install the openai package:
+
+```bash
+pip install openai
+```
+
+2. Deploy the inference acceleration service API for vLLM or LMDeploy. Below is an example for LMDeploy:
+
+```bash
+lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --model-name Meta-Llama-3-8B-Instruct --server-port 23333
+```
+
+Parameters for starting the api_server can be checked using `lmdeploy serve api_server -h`, such as --tp for tensor parallelism, --session-len for the maximum context window length, --cache-max-entry-count for adjusting the k/v cache memory usage ratio, etc.
+
+3. Once the service is successfully deployed, modify the evaluation script by changing the model configuration path to the service address, as shown below:
+
+```python
+from opencompass.models import OpenAI
+
+api_meta_template = dict(
+    round=[
+        dict(role='HUMAN', api_role='HUMAN'),
+        dict(role='BOT', api_role='BOT', generate=True),
+    ],
+    reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
+)
+
+models = [
+    dict(
+        abbr='Meta-Llama-3-8B-Instruct-LMDeploy-API',
+        type=OpenAI,
+        openai_api_base='http://0.0.0.0:23333/v1',  # Service address
+        path='Meta-Llama-3-8B-Instruct',  # Model name for service request
+        rpm_verbose=True,  # Whether to print request rate
+        meta_template=api_meta_template,  # Service request template
+        query_per_second=1,  # Service request rate
+        max_out_len=1024,  # Maximum output length
+        max_seq_len=4096,  # Maximum input length
+        temperature=0.01,  # Generation temperature
+        batch_size=8,  # Batch size
+        retry=3,  # Number of retries
+    )
+]
+```
+
+## Acceleration Effect and Performance Comparison
+
+Below is a comparison table of the acceleration effect and performance when using VLLM or LMDeploy on a single A800 GPU for evaluating the Llama-3-8B-Instruct model on the GSM8k dataset:
+
+| Inference Backend | Accuracy | Inference Time (minutes:seconds) | Speedup (relative to Huggingface) |
+| ----------------- | -------- | -------------------------------- | --------------------------------- |
+| Huggingface       | 74.22    | 24:26                            | 1.0                               |
+| LMDeploy          | 73.69    | 11:15                            | 2.2                               |
+| VLLM              | 72.63    | 07:52                            | 3.1                               |
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
@ -71,6 +71,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
   advanced_guides/circular_eval.md
   advanced_guides/contamination_eval.md
   advanced_guides/needleinahaystack_eval.md
+   advanced_guides/accelerator_intro.md

 .. _Tools:
 .. toctree::
--- a/docs/zh_cn/advanced_guides/accelerator_intro.md
+++ b/docs/zh_cn/advanced_guides/accelerator_intro.md
@ -0,0 +1,140 @@
+# 使用 vLLM 或 LMDeploy 来一键式加速评测推理
+
+## 背景
+
+在 OpenCompass 评测过程中，默认使用 Huggingface 的 transformers 库进行推理，这是一个非常通用的方案，但在某些情况下，我们可能需要更高效的推理方法来加速这一过程，比如借助 VLLM 或 LMDeploy。
+
+- [LMDeploy](https://github.com/InternLM/lmdeploy) 是一个用于压缩、部署和服务大型语言模型（LLM）的工具包，由 [MMRazor](https://github.com/open-mmlab/mmrazor) 和 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 团队开发。
+- [vLLM](https://github.com/vllm-project/vllm) 是一个快速且易于使用的 LLM 推理和服务库，具有先进的服务吞吐量、高效的 PagedAttention 内存管理、连续批处理请求、CUDA/HIP 图的快速模型执行、量化技术（如 GPTQ、AWQ、SqueezeLLM、FP8 KV Cache）以及优化的 CUDA 内核。
+
+## 加速前准备
+
+首先，请检查您要评测的模型是否支持使用 vLLM 或 LMDeploy 进行推理加速。其次，请确保您已经安装了 vLLM 或 LMDeploy，具体安装方法请参考它们的官方文档，下面是参考的安装方法：
+
+### LMDeploy 安装方法
+
+使用 pip (Python 3.8+) 或从 [源码](https://github.com/InternLM/lmdeploy/blob/main/docs/en/build.md) 安装 LMDeploy：
+
+```bash
+pip install lmdeploy
+```
+
+### VLLM 安装方法
+
+使用 pip 或从 [源码](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source) 安装 vLLM：
+
+```bash
+pip install vllm
+```
+
+## 评测时使用 VLLM 或 LMDeploy
+
+### 方法1：使用命令行参数来变更推理后端
+
+OpenCompass 提供了一键式的评测加速，可以在评测过程中自动将 Huggingface 的 transformers 模型转化为 VLLM 或 LMDeploy 的模型，以便在评测过程中使用。以下是使用默认 Huggingface 版本的 llama3-8b-instruct 模型评测 GSM8k 数据集的样例代码：
+
+```python
+# eval_gsm8k.py
+from mmengine.config import read_base
+
+with read_base():
+    # 选择一个数据集列表
+    from .datasets.gsm8k.gsm8k_0shot_gen_a58960 import gsm8k_datasets as datasets
+    # 选择一个感兴趣的模型
+    from ..models.hf_llama.hf_llama3_8b_instruct import models
+```
+
+其中 `hf_llama3_8b_instruct` 为原版 Huggingface 模型配置，内容如下：
+
+```python
+from opencompass.models import HuggingFacewithChatTemplate
+
+models = [
+    dict(
+        type=HuggingFacewithChatTemplate,
+        abbr='llama-3-8b-instruct-hf',
+        path='meta-llama/Meta-Llama-3-8B-Instruct',
+        max_out_len=1024,
+        batch_size=8,
+        run_cfg=dict(num_gpus=1),
+        stop_words=['<|end_of_text|>', '<|eot_id|>'],
+    )
+]
+```
+
+默认 Huggingface 版本的 Llama3-8b-instruct 模型评测 GSM8k 数据集的方式如下：
+
+```bash
+python run.py config/eval_gsm8k.py
+```
+
+如果需要使用 vLLM 或 LMDeploy 进行加速评测，可以使用下面的脚本：
+
+```bash
+python run.py config/eval_gsm8k.py -a vllm
+```
+
+或
+
+```bash
+python run.py config/eval_gsm8k.py -a lmdeploy
+```
+
+### 方法2：通过部署推理加速服务API来加速评测
+
+OpenCompass 还支持通过部署vLLM或LMDeploy的推理加速服务 API 来加速评测，参考步骤如下：
+
+1. 安装openai包：
+
+```bash
+pip install openai
+```
+
+2. 部署 vLLM 或 LMDeploy 的推理加速服务 API，具体部署方法请参考它们的官方文档，下面以LMDeploy为例：
+
+```bash
+lmdeploy serve api_server meta-llama/Meta-Llama-3-8B-Instruct --model-name Meta-Llama-3-8B-Instruct --server-port 23333
+```
+
+api_server 启动时的参数可以通过命令行`lmdeploy serve api_server -h`查看。 比如，--tp 设置张量并行，--session-len 设置推理的最大上下文窗口长度，--cache-max-entry-count 调整 k/v cache 的内存使用比例等等。
+
+3. 服务部署成功后，修改评测脚本，将模型配置中的路径改为部署的服务地址，如下：
+
+```python
+from opencompass.models import OpenAI
+
+api_meta_template = dict(
+    round=[
+        dict(role='HUMAN', api_role='HUMAN'),
+        dict(role='BOT', api_role='BOT', generate=True),
+    ],
+    reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
+)
+
+models = [
+    dict(
+        abbr='Meta-Llama-3-8B-Instruct-LMDeploy-API',
+        type=OpenAI,
+        openai_api_base='http://0.0.0.0:23333/v1', # 服务地址
+        path='Meta-Llama-3-8B-Instruct ', # 请求服务时的 model name
+        rpm_verbose=True, # 是否打印请求速率
+        meta_template=api_meta_template, # 服务请求模板
+        query_per_second=1, # 服务请求速率
+        max_out_len=1024, # 最大输出长度
+        max_seq_len=4096, # 最大输入长度
+        temperature=0.01, # 生成温度
+        batch_size=8, # 批处理大小
+        retry=3, # 重试次数
+    )
+]
+```
+
+## 加速效果及性能对比
+
+下面是使用 VLLM 或 LMDeploy 在单卡 A800 上 Llama-3-8B-Instruct 模型对 GSM8k 数据集进行加速评测的效果及性能对比表：
+
+| 推理后端    | 精度（Accuracy） | 推理时间（分钟：秒） | 加速比（相对于 Huggingface） |
+| ----------- | ---------------- | -------------------- | ---------------------------- |
+| Huggingface | 74.22            | 24:26                | 1.0                          |
+| LMDeploy    | 73.69            | 11:15                | 2.2                          |
+| VLLM        | 72.63            | 07:52                | 3.1                          |
--- a/docs/zh_cn/index.rst
+++ b/docs/zh_cn/index.rst
@ -72,6 +72,7 @@ OpenCompass 上手路线
   advanced_guides/contamination_eval.md
   advanced_guides/compassbench_intro.md
   advanced_guides/needleinahaystack_eval.md
+   advanced_guides/accelerator_intro.md

 .. _工具:
 .. toctree::