
* add codegeex2 * add humanevalx dataset * add evaluator * update evaluator * update configs * update clean code * update configs * fix lint * remove sleep * fix lint * update docs * fix lint
3.4 KiB
Code Evaluation Service
We support evaluating datasets of multiple programming languages, similar to humaneval-x. Before starting, make sure that you have started the code evaluation service. You can refer to the code-evaluator project for the code evaluation service.
Launching the Code Evaluation Service
Make sure you have installed Docker, then build an image and run a container service.
Build the Docker image:
git clone https://github.com/Ezra-Yu/code-evaluator.git
cd code-evaluator/docker
sudo docker build -t code-eval:latest .
After obtaining the image, create a container with the following commands:
# Log output format
sudo docker run -it -p 5000:5000 code-eval:latest python server.py
# Run the program in the background
# sudo docker run -itd -p 5000:5000 code-eval:latest python server.py
# Using different ports
# sudo docker run -itd -p 5001:5001 code-eval:latest python server.py --port 5001
Ensure that you can access the service and check the following commands (skip this step if you are running the service on a local host):
ping your_service_ip_address
telnet your_service_ip_address your_service_port
If computing nodes cannot connect to the evaluation service, you can directly run `python run.py xxx...`. The resulting code will be saved in the 'outputs' folder. After migration, use [code-evaluator](https://github.com/Ezra-Yu/code-evaluator) directly to get the results (no need to consider the eval_cfg configuration later).
Configuration File
We have provided the configuration file for evaluating huamaneval-x on codegeex2 .
The dataset and related post-processing configuration files can be found at this link. Note the evaluator
field in humanevalx_eval_cfg_dict
.
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import HumanevalXDataset, HumanevalXEvaluator
humanevalx_reader_cfg = dict(
input_columns=['prompt'], output_column='task_id', train_split='test')
humanevalx_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template='{prompt}'),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=1024))
humanevalx_eval_cfg_dict = {
lang : dict(
evaluator=dict(
type=HumanevalXEvaluator,
language=lang,
ip_address="localhost", # replace to your code_eval_server ip_address, port
port=5000), # refer to https://github.com/Ezra-Yu/code-evaluator to launch a server
pred_role='BOT')
for lang in ['python', 'cpp', 'go', 'java', 'js'] # do not support rust now
}
humanevalx_datasets = [
dict(
type=HumanevalXDataset,
abbr=f'humanevalx-{lang}',
language=lang,
path='./data/humanevalx',
reader_cfg=humanevalx_reader_cfg,
infer_cfg=humanevalx_infer_cfg,
eval_cfg=humanevalx_eval_cfg_dict[lang])
for lang in ['python', 'cpp', 'go', 'java', 'js']
]