OpenCompass/docs/en/advanced_guides/code_eval_service.md

# Code Evaluation Service

We support evaluating datasets of multiple programming languages, similar to [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x). Before starting, make sure that you have started the code evaluation service. You can refer to the [code-evaluator](https://github.com/Ezra-Yu/code-evaluator) project for the code evaluation service.

## Launching the Code Evaluation Service

Make sure you have installed Docker, then build an image and run a container service.

Build the Docker image:

```shell
git clone https://github.com/Ezra-Yu/code-evaluator.git
cd code-evaluator/docker
sudo docker build -t code-eval:latest .
```

After obtaining the image, create a container with the following commands:

```shell
# Log output format
sudo docker run -it -p 5000:5000 code-eval:latest python server.py

# Run the program in the background
# sudo docker run -itd -p 5000:5000 code-eval:latest python server.py

# Using different ports
# sudo docker run -itd -p 5001:5001 code-eval:latest python server.py --port 5001
```

Ensure that you can access the service and check the following commands (skip this step if you are running the service on a local host):

```shell
ping your_service_ip_address
telnet your_service_ip_address your_service_port
```

```note
If computing nodes cannot connect to the evaluation service, you can directly run `python run.py xxx...`. The resulting code will be saved in the 'outputs' folder. After migration, use [code-evaluator](https://github.com/Ezra-Yu/code-evaluator) directly to get the results (no need to consider the eval_cfg configuration later).
```

## Configuration File

We have provided the [configuration file](https://github.com/InternLM/opencompass/blob/main/configs/eval_codegeex2.py) for evaluating huamaneval-x on codegeex2 .

The dataset and related post-processing configuration files can be found at this [link](https://github.com/InternLM/opencompass/tree/main/configs/datasets/humanevalx). Note the `evaluator` field in `humanevalx_eval_cfg_dict`.

```python
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import HumanevalXDataset, HumanevalXEvaluator

humanevalx_reader_cfg = dict(
    input_columns=['prompt'], output_column='task_id', train_split='test')

humanevalx_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template='{prompt}'),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=GenInferencer, max_out_len=1024))

humanevalx_eval_cfg_dict = {
    lang : dict(
            evaluator=dict(
                type=HumanevalXEvaluator,
                language=lang,
                ip_address="localhost",    # replace to your code_eval_server ip_address, port
                port=5000),               # refer to https://github.com/Ezra-Yu/code-evaluator to launch a server
            pred_role='BOT')
    for lang in ['python', 'cpp', 'go', 'java', 'js']   # do not support rust now
}

humanevalx_datasets = [
    dict(
        type=HumanevalXDataset,
        abbr=f'humanevalx-{lang}',
        language=lang,
        path='./data/humanevalx',
        reader_cfg=humanevalx_reader_cfg,
        infer_cfg=humanevalx_infer_cfg,
        eval_cfg=humanevalx_eval_cfg_dict[lang])
    for lang in ['python', 'cpp', 'go', 'java', 'js']
]
```
[Feat] Add codegeex2 and Humanevalx (#210) * add codegeex2 * add humanevalx dataset * add evaluator * update evaluator * update configs * update clean code * update configs * fix lint * remove sleep * fix lint * update docs * fix lint 2023-08-17 11:03:16 +08:00			`# Code Evaluation Service`

			`We support evaluating datasets of multiple programming languages, similar to [humaneval-x](https://huggingface.co/datasets/THUDM/humaneval-x). Before starting, make sure that you have started the code evaluation service. You can refer to the [code-evaluator](https://github.com/Ezra-Yu/code-evaluator) project for the code evaluation service.`

			`## Launching the Code Evaluation Service`

			`Make sure you have installed Docker, then build an image and run a container service.`

			`Build the Docker image:`

			```shell
			`git clone https://github.com/Ezra-Yu/code-evaluator.git`
			`cd code-evaluator/docker`
			`sudo docker build -t code-eval:latest .`
			```

			`After obtaining the image, create a container with the following commands:`

			```shell
			`# Log output format`
			`sudo docker run -it -p 5000:5000 code-eval:latest python server.py`

			`# Run the program in the background`
			`# sudo docker run -itd -p 5000:5000 code-eval:latest python server.py`

			`# Using different ports`
			`# sudo docker run -itd -p 5001:5001 code-eval:latest python server.py --port 5001`
			```

			`Ensure that you can access the service and check the following commands (skip this step if you are running the service on a local host):`

			```shell
			`ping your_service_ip_address`
			`telnet your_service_ip_address your_service_port`
			```

			```note
			If computing nodes cannot connect to the evaluation service, you can directly run `python run.py xxx...`. The resulting code will be saved in the 'outputs' folder. After migration, use [code-evaluator](https://github.com/Ezra-Yu/code-evaluator) directly to get the results (no need to consider the eval_cfg configuration later).
			```

			`## Configuration File`

			`We have provided the [configuration file](https://github.com/InternLM/opencompass/blob/main/configs/eval_codegeex2.py) for evaluating huamaneval-x on codegeex2 .`

			The dataset and related post-processing configuration files can be found at this [link](https://github.com/InternLM/opencompass/tree/main/configs/datasets/humanevalx). Note the `evaluator` field in `humanevalx_eval_cfg_dict`.

			```python
			`from opencompass.openicl.icl_prompt_template import PromptTemplate`
			`from opencompass.openicl.icl_retriever import ZeroRetriever`
			`from opencompass.openicl.icl_inferencer import GenInferencer`
			`from opencompass.datasets import HumanevalXDataset, HumanevalXEvaluator`

			`humanevalx_reader_cfg = dict(`
			`input_columns=['prompt'], output_column='task_id', train_split='test')`

			`humanevalx_infer_cfg = dict(`
			`prompt_template=dict(`
			`type=PromptTemplate,`
			`template='{prompt}'),`
			`retriever=dict(type=ZeroRetriever),`
			`inferencer=dict(type=GenInferencer, max_out_len=1024))`

			`humanevalx_eval_cfg_dict = {`
			`lang : dict(`
			`evaluator=dict(`
			`type=HumanevalXEvaluator,`
			`language=lang,`
			`ip_address="localhost", # replace to your code_eval_server ip_address, port`
			`port=5000), # refer to https://github.com/Ezra-Yu/code-evaluator to launch a server`
			`pred_role='BOT')`
			`for lang in ['python', 'cpp', 'go', 'java', 'js'] # do not support rust now`
			`}`

			`humanevalx_datasets = [`
			`dict(`
			`type=HumanevalXDataset,`
			`abbr=f'humanevalx-{lang}',`
			`language=lang,`
			`path='./data/humanevalx',`
			`reader_cfg=humanevalx_reader_cfg,`
			`infer_cfg=humanevalx_infer_cfg,`
			`eval_cfg=humanevalx_eval_cfg_dict[lang])`
			`for lang in ['python', 'cpp', 'go', 'java', 'js']`
			`]`
			```