[Docs] update code evaluator docs (#354)

* [Docs] update code evaluator docs * minor fix * minor fix
2025-05-30 16:03:24 +08:00 · 2023-09-06 17:52:22 +08:00 · 2023-09-06 17:52:22 +08:00 · 2c71b0f6f3
commit 2c71b0f6f3
parent 880b34e759
2 changed files with 212 additions and 31 deletions
--- a/docs/en/advanced_guides/code_eval_service.md
+++ b/docs/en/advanced_guides/code_eval_service.md
@ -1,20 +1,19 @@
-# Code Evaluation Service
+# Code Evaluation Tutorial
-We support evaluating datasets of multiple programming languages, similar to [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x). Before starting, make sure that you have started the code evaluation service. You can refer to the [code-evaluator](https://github.com/Ezra-Yu/code-evaluator) project for the code evaluation service.
+To complete LLM code capability evaluation, we need to set up an independent evaluation environment to avoid executing erroneous codes on development environments which would cause unavoidable losses. The current Code Evaluation Service used in OpenCompass refers to the project [code-evaluator](https://github.com/open-compass/code-evaluator.git), which has already supported evaluating datasets for multiple programming languages [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x). The following tutorials will introduce how to conduct code review services under different requirements.
 ## Launching the Code Evaluation Service
-Make sure you have installed Docker, then build an image and run a container service.
+1. Ensure you have installed Docker, please refer to [Docker installation document](https://docs.docker.com/engine/install/).
-
+2. Pull the source code of the code evaluation service project and build the Docker image.
 Build the Docker image:
 ```shell
-git clone https://github.com/Ezra-Yu/code-evaluator.git
+git clone https://github.com/open-compass/code-evaluator.git
 cd code-evaluator/docker
 sudo docker build -t code-eval:latest .
 ```
-After obtaining the image, create a container with the following commands:
+3. Create a container with the following commands:
 ```shell
 # Log output format
@ -27,22 +26,22 @@ sudo docker run -it -p 5000:5000 code-eval:latest python server.py
 # sudo docker run -itd -p 5001:5001 code-eval:latest python server.py --port 5001
 ```
-Ensure that you can access the service and check the following commands (skip this step if you are running the service on a local host):
+4. To ensure you have access to the service, use the following command to check the inference environment and evaluation service connection status. (If both inferences and code evaluations run on the same host, skip this step.)
 ```shell
 ping your_service_ip_address
 telnet your_service_ip_address your_service_port
 ```
-```note
+## Local Code Evaluation
 If computing nodes cannot connect to the evaluation service, you can directly run `python run.py xxx...`. The resulting code will be saved in the 'outputs' folder. After migration, use [code-evaluator](https://github.com/Ezra-Yu/code-evaluator) directly to get the results (no need to consider the eval_cfg configuration later).
 ```
-## Configuration File
+When the model inference and code evaluation services are running on the same host or within the same local area network, direct code reasoning and evaluation can be performed.
-We have provided the [configuration file](https://github.com/InternLM/opencompass/blob/main/configs/eval_codegeex2.py) for evaluating huamaneval-x on codegeex2 .
+### Configuration File
-The dataset and related post-processing configuration files can be found at this [link](https://github.com/InternLM/opencompass/tree/main/configs/datasets/humanevalx). Note the `evaluator` field in `humanevalx_eval_cfg_dict`.
+We provide [the configuration file](https://github.com/InternLM/opencompass/blob/main/configs/eval_codegeex2.py) of using `humanevalx` for evaluation on `codegeex2` as reference.
 The dataset and related post-processing configurations files can be found at this [link](https://github.com/InternLM/opencompass/tree/main/configs/datasets/humanevalx) with attention paid to the `evaluator` field in the humanevalx_eval_cfg_dict.
 ```python
 from opencompass.openicl.icl_prompt_template import PromptTemplate
@ -66,7 +65,7 @@ humanevalx_eval_cfg_dict = {
                type=HumanevalXEvaluator,
                language=lang,
                ip_address="localhost",    # replace to your code_eval_server ip_address, port
-                port=5000),               # refer to https://github.com/Ezra-Yu/code-evaluator to launch a server
+                port=5000),               # refer to https://github.com/open-compass/code-evaluator to launch a server
            pred_role='BOT')
    for lang in ['python', 'cpp', 'go', 'java', 'js']   # do not support rust now
 }
@ -83,3 +82,96 @@ humanevalx_datasets = [
    for lang in ['python', 'cpp', 'go', 'java', 'js']
 ]
 ```
 ### Task Launch
 Refer to the [Quick Start](../get_started.html)
 ## Remote Code Evaluation
 Model inference and code evaluation services located in different machines which cannot be accessed directly require prior model inference before collecting the code evaluation results. The configuration file and inference process can be reused from the previous tutorial.
 ### Collect Inference Results
 In OpenCompass's tools folder, there is a script called `collect_code_preds.py` provided to process and collect the inference results after providing the task launch configuration file during startup along with specifying the working directory used corresponding to the task.
 It is the same with `-r` option in `run.py`. More details can be referred through the [documentation](https://opencompass.readthedocs.io/en/latest/get_started.html#launch-evaluation).
 ```shell
 python tools/collect_code_preds.py [config] [-r latest]
 ```
 The collected results will be organized as following under the `-r` folder:
 ```
 workdir/humanevalx
 ├── codegeex2-6b
 │   ├── humanevalx_cpp.json
 │   ├── humanevalx_go.json
 │   ├── humanevalx_java.json
 │   ├── humanevalx_js.json
 │   └── humanevalx_python.json
 ├── CodeLlama-13b
 │   ├── ...
 ├── CodeLlama-13b-Instruct
 │   ├── ...
 ├── CodeLlama-13b-Python
 │   ├── ...
 ├── ...
 ```
 ### Code Evaluation
 Make sure your code evaluation service is started, and use `curl` to request:
 ```shell
 curl -X POST -F 'file=@{result_absolute_path}' -F 'dataset={dataset/language}' {your_service_ip_address}:{your_service_port}/evaluate
 ```
 For example:
 ```shell
 curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx/python' localhost:5000/evaluate
 ```
 The we have:
 ```
 "{\"pass@1\": 37.19512195121951%}"
 ```
 Additionally, we offer an extra option named `with_prompt`(Defaults to `True`), since some models(like `WizardCoder`) generate complete codes without requiring the form of concatenating prompt and prediction. You may refer to the following commands for evaluation.
 ```shell
 curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx/python' -H 'with-prompt: False' localhost:5000/evaluate
 ```
 ## Advanced Tutorial
 Besides evaluating the supported HUMANEVAList data set, users might also need:
 ### Support New Dataset
 Please refer to the [tutorial on supporting new datasets](./new_dataset.md).
 ### Modify Post-Processing
 1. For local evaluation, follow the post-processing section in the tutorial on supporting new datasets to modify the post-processing method.
 2. For remote evaluation, please modify the post-processing part in the tool's `collect_code_preds.py`.
 3. Some parts of post-processing could also be modified in the code evaluation service, more information will be available in the next section.
 ### Debugging Code Evaluation Service
 When supporting new datasets or modifying post-processors, it is possible that modifications need to be made to the original code evaluation service. Please make changes based on the following steps:
 1. Remove the installation of the `code-evaluator` in `Dockerfile`, mount the `code-evaluator` when starting the container instead:
 ```shell
 sudo docker run -it -p 5000:5000 -v /local/path/of/code-evaluator:/workspace/code-evaluator code-eval:latest bash
 ```
 2. Install and start the code evaluation service locally. At this point, any necessary modifications can be made to the local copy of the `code-evaluator`.
 ```shell
 cd code-evaluator && pip install -r requirements.txt
 python server.py
 ```
--- a/docs/zh_cn/advanced_guides/code_eval_service.md
+++ b/docs/zh_cn/advanced_guides/code_eval_service.md
@ -1,20 +1,19 @@
-# 代码评测服务
+# 代码评测教程
-我们支持评测多编程语言的数据集，类似 [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x). 在启动之前需要确保你已经启动了代码评测服务，代码评测服务可参考[code-evaluator](https://github.com/Ezra-Yu/code-evaluator)项目。
+为了完成LLM代码能力评测，我们需要搭建一套独立的评测环境，避免在开发环境执行错误代码从而造成不可避免的损失。目前 OpenCompass 使用的代码评测服务可参考[code-evaluator](https://github.com/open-compass/code-evaluator)项目，并已经支持评测多编程语言的数据集 [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x)。接下来将围绕代码评测服务介绍不同需要下的评测教程。
 ## 启动代码评测服务
-确保您已经安装了 docker，然后构建一个镜像并运行一个容器服务。
+1. 确保您已经安装了 docker，可参考[安装docker文档](https://docs.docker.com/engine/install/)
-
+2. 拉取代码评测服务项目，并构建 docker 镜像
 构建 Docker 镜像：
 ```shell
-git clone https://github.com/Ezra-Yu/code-evaluator.git
+git clone https://github.com/open-compass/code-evaluator.git
 cd code-evaluator/docker
 sudo docker build -t code-eval:latest .
 ```
-获取镜像后，使用以下命令创建容器：
+3. 使用以下命令创建容器
 ```shell
 # 输出日志格式
@ -27,23 +26,21 @@ sudo docker run -it -p 5000:5000 code-eval:latest python server.py
 # sudo docker run -itd -p 5001:5001 code-eval:latest python server.py --port 5001
 ```
-确保您能够访问服务，检查以下命令(如果在本地主机中运行服务，就跳过这个操作)：
+4. 为了确保您能够访问服务，通过以下命令检测推理环境和评测服务访问情况。 (如果推理和代码评测在同一主机中运行服务，就跳过这个操作)
 ```shell
 ping your_service_ip_address
 telnet your_service_ip_address your_service_port
 ```
-```note
+## 本地代码评测
 如果运算节点不能连接到评估服务，也可直接运行 `python run.py xxx...`，代码生成结果会保存在 'outputs' 文件夹下，迁移后直接使用 [code-evaluator](https://github.com/Ezra-Yu/code-evaluator) 评测得到结果（不需要考虑后面 eval_cfg 的配置）。
 ```
-## 配置文件
+模型推理和代码评测服务在同一主机，或者同一局域网中，可以直接进行代码推理及评测。
-我么已经给了 huamaneval-x 在 codegeex2 上评估的[配置文件](https://github.com/InternLM/opencompass/blob/main/configs/eval_codegeex2.py)。
+### 配置文件
-其中数据集以及相关后处理的配置文件为这个[链接](https://github.com/InternLM/opencompass/tree/main/configs/datasets/humanevalx)， 需要注意 `humanevalx_eval_cfg_dict` 中的
+我们已经提供了 huamaneval-x 在 codegeex2 上评估的\[配置文件\]作为参考(https://github.com/InternLM/opencompass/blob/main/configs/eval_codegeex2.py)。
-`evaluator` 字段。
+其中数据集以及相关后处理的配置文件为这个[链接](https://github.com/InternLM/opencompass/tree/main/configs/datasets/humanevalx)， 需要注意 humanevalx_eval_cfg_dict 中的evaluator 字段。
 ```python
 from opencompass.openicl.icl_prompt_template import PromptTemplate
@ -67,7 +64,7 @@ humanevalx_eval_cfg_dict = {
                type=HumanevalXEvaluator,
                language=lang,
                ip_address="localhost",    # replace to your code_eval_server ip_address, port
-                port=5000),               # refer to https://github.com/Ezra-Yu/code-evaluator to launch a server
+                port=5000),               # refer to https://github.com/open-compass/code-evaluator to launch a server
            pred_role='BOT')
    for lang in ['python', 'cpp', 'go', 'java', 'js']   # do not support rust now
 }
@ -84,3 +81,95 @@ humanevalx_datasets = [
    for lang in ['python', 'cpp', 'go', 'java', 'js']
 ]
 ```
 ### 任务启动
 参考[快速上手教程](../get_started.html)
 ## 异地代码评测
 模型推理和代码评测服务分别在不可访问的不同机器中，需要先进行模型推理，收集代码推理结果。配置文件和推理流程都可以复用上面的教程。
 ### 收集推理结果
 OpenCompass 在 `tools` 中提供了 `collect_code_preds.py` 脚本对推理结果进行后处理并收集，我们只需要提供启动任务时的配置文件，以及指定复用对应任务的工作目录，其配置与 `run.py` 中的 `-r` 一致，细节可参考[文档](https://opencompass.readthedocs.io/zh_CN/latest/get_started.html#id7)。
 ```shell
 python tools/collect_code_preds.py [config] [-r latest]
 ```
 收集到的结果将会按照以下的目录结构保存到 `-r` 对应的工作目录中：
 ```
 workdir/humanevalx
 ├── codegeex2-6b
 │   ├── humanevalx_cpp.json
 │   ├── humanevalx_go.json
 │   ├── humanevalx_java.json
 │   ├── humanevalx_js.json
 │   └── humanevalx_python.json
 ├── CodeLlama-13b
 │   ├── ...
 ├── CodeLlama-13b-Instruct
 │   ├── ...
 ├── CodeLlama-13b-Python
 │   ├── ...
 ├── ...
 ```
 ### 代码评测
 确保代码评测服务启动的情况下，使用 `curl` 提交请求：
 ```shell
 curl -X POST -F 'file=@{result_absolute_path}' -F 'dataset={dataset/language}' {your_service_ip_address}:{your_service_port}/evaluate
 ```
 例如：
 ```shell
 curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx/python' localhost:5000/evaluate
 ```
 得到结果：
 ```
 "{\"pass@1\": 37.19512195121951%}"
 ```
 另外我们额外提供了 `with-prompt` 选项（默认为True），由于有些模型生成结果包含完整的代码（如WizardCoder），不需要 prompt + prediciton 的形式进行拼接，可以参考以下命令进行评测。
 ```shell
 curl -X POST -F 'file=@./examples/humanevalx/python.json' -F 'dataset=humanevalx/python' -H 'with-prompt: False' localhost:5000/evaluate
 ```
 ## 进阶教程
 除了评测已支持的 `humanevalx` 数据集以外，用户还可能有以下需求:
 ### 支持新数据集
 可以参考[支持新数据集教程](./new_dataset.md)
 ### 修改后处理
 1. 本地评测中，可以按照支持新数据集教程中的后处理部分来修改后处理方法；
 2. 异地评测中，可以修改 `tools/collect_code_preds.py` 中的后处理部分；
 3. 代码评测服务中，存在部分后处理也可以进行修改，详情参考下一部分教程；
 ### 代码评测服务 Debug
 在支持新数据集或者修改后处理的过程中，可能会遇到需要修改原本的代码评测服务的情况，按照需求修改以下部分
 1. 删除 `Dockerfile` 中安装 `code-evaluator` 的部分，在启动容器时将 `code-evaluator` 挂载
 ```shell
 sudo docker run -it -p 5000:5000 -v /local/path/of/code-evaluator:/workspace/code-evaluator code-eval:latest bash
 ```
 2. 安装并启动代码评测服务，此时可以根据需要修改本地 `code-evaluator` 中的代码来进行调试
 ```shell
 cd code-evaluator && pip install -r requirements.txt
 python server.py
 ```