mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00
Merge branch 'main' of https://github.com/kangreen0210/opencompass
This commit is contained in:
commit
89bbf13f5a
@ -57,6 +57,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
|
||||
|
||||
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
|
||||
|
||||
- **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
|
||||
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
|
||||
- **\[2025.01.16\]** We now support the [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) model which has enhanced performance on reasoning and knowledge-intensive tasks.
|
||||
- **\[2024.12.17\]** We have provided the evaluation script for the December [CompassAcademic](examples/eval_academic_leaderboard_202412.py), which allows users to easily reproduce the official evaluation results by configuring it.
|
||||
|
@ -57,6 +57,7 @@
|
||||
|
||||
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
|
||||
|
||||
- **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程,请查看 [评估推理模型](docs/en/user_guides/deepseek_r1.md) 了解更多详情!🔥🔥🔥
|
||||
- **\[2025.02.15\]** 我们新增了两个实用的评测工具:用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情!🔥🔥🔥
|
||||
- **\[2025.01.16\]** 我们现已支持 [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) 模型,该模型在推理、知识类任务上取得同量级最优性能,欢迎尝试。
|
||||
- **\[2024.12.17\]** 我们提供了12月CompassAcademic学术榜单评估脚本 [CompassAcademic](configs/eval_academic_leaderboard_202412.py),你可以通过简单地配置复现官方评测结果。
|
||||
|
@ -399,6 +399,11 @@
|
||||
category: Math
|
||||
paper: https://proceedings.mlr.press/v202/gao23f/gao23f.pdf
|
||||
configpath: opencompass/configs/datasets/gsm_hard
|
||||
- hle:
|
||||
name: HLE(Humanity's Last Exam)
|
||||
category: Reasoning
|
||||
paper: https://lastexam.ai/paper
|
||||
configpath: opencompass/configs/datasets/HLE
|
||||
- hellaswag:
|
||||
name: HellaSwag
|
||||
category: Reasoning
|
||||
|
65
docs/en/advanced_guides/persistence.md
Normal file
65
docs/en/advanced_guides/persistence.md
Normal file
@ -0,0 +1,65 @@
|
||||
# Evaluation Results Persistence
|
||||
|
||||
## Introduction
|
||||
|
||||
Normally, the evaluation results of OpenCompass will be saved to your work directory. But in some cases, there may be a need for data sharing among users or quickly browsing existing public evaluation results. Therefore, we provide an interface that can quickly transfer evaluation results to external public data stations, and on this basis, provide functions such as uploading, overwriting, and reading.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Uploading
|
||||
|
||||
By adding `args` to the evaluation command or adding configuration in the Eval script, the results of evaluation can be stored in the path you specify. Here are the examples:
|
||||
|
||||
(Approach 1) Add an `args` option to the command and specify your public path address.
|
||||
|
||||
```bash
|
||||
opencompass ... -sp '/your_path'
|
||||
```
|
||||
|
||||
(Approach 2) Add configuration in the Eval script.
|
||||
|
||||
```pythonE
|
||||
station_path = '/your_path'
|
||||
```
|
||||
|
||||
### Overwriting
|
||||
|
||||
The above storage method will first determine whether the same task result already exists in the data station based on the `abbr` attribute in the model and dataset configuration before uploading data. If results already exists, cancel this storage. If you need to update these results, please add the `station-overwrite` option to the command, here is an example:
|
||||
|
||||
```bash
|
||||
opencompass ... -sp '/your_path' --station-overwrite
|
||||
```
|
||||
|
||||
### Reading
|
||||
|
||||
You can directly read existing results from the data station to avoid duplicate evaluation tasks. The read results will directly participate in the 'summarize' step. When using this configuration, only tasks that do not store results in the data station will be initiated. Here is an example:
|
||||
|
||||
```bash
|
||||
opencompass ... -sp '/your_path' --read-from-station
|
||||
```
|
||||
|
||||
### Command Combination
|
||||
|
||||
1. Only upload the results under your latest working directory to the data station, without supplementing tasks that missing results:
|
||||
|
||||
```bash
|
||||
opencompass ... -sp '/your_path' -r latest -m viz
|
||||
```
|
||||
|
||||
## Storage Format of the Data Station
|
||||
|
||||
In the data station, the evaluation results are stored as `json` files for each `model-dataset` pair. The specific directory form is `/your_path/dataset_name/model_name.json `. Each `json` file stores a dictionary corresponding to the results, including `predictions`, `results`, and `cfg`, here is an example:
|
||||
|
||||
```pythonE
|
||||
Result = {
|
||||
'predictions': List[Dict],
|
||||
'results': Dict,
|
||||
'cfg': Dict = {
|
||||
'models': Dict,
|
||||
'datasets': Dict,
|
||||
(Only subjective datasets)'judge_models': Dict
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Among this three keys, `predictions` records the predictions of the model on each item of data in the dataset. `results` records the total score of the model on the dataset. `cfg` records detailed configurations of the model and the dataset in this evaluation task.
|
@ -39,6 +39,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
|
||||
user_guides/evaluation.md
|
||||
user_guides/experimentation.md
|
||||
user_guides/metrics.md
|
||||
user_guides/deepseek_r1.md
|
||||
|
||||
.. _Prompt:
|
||||
.. toctree::
|
||||
@ -66,6 +67,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
|
||||
advanced_guides/code_eval.md
|
||||
advanced_guides/code_eval_service.md
|
||||
advanced_guides/subjective_evaluation.md
|
||||
advanced_guides/persistence.md
|
||||
|
||||
.. _Tools:
|
||||
.. toctree::
|
||||
|
192
docs/en/user_guides/deepseek_r1.md
Normal file
192
docs/en/user_guides/deepseek_r1.md
Normal file
@ -0,0 +1,192 @@
|
||||
# Tutorial for Evaluating Reasoning Models
|
||||
|
||||
OpenCompass provides an evaluation tutorial for DeepSeek R1 series reasoning models (mathematical datasets).
|
||||
|
||||
- At the model level, we recommend using the sampling approach to reduce repetitions caused by greedy decoding
|
||||
- For datasets with limited samples, we employ multiple evaluation runs and take the average
|
||||
- For answer validation, we utilize LLM-based verification to reduce misjudgments from rule-based evaluation
|
||||
|
||||
## Installation and Preparation
|
||||
|
||||
Please follow OpenCompass's installation guide.
|
||||
|
||||
## Evaluation Configuration Setup
|
||||
|
||||
We provide example configurations in `examples/eval_deepseek_r1.py`. Below is the configuration explanation:
|
||||
|
||||
### Configuration Interpretation
|
||||
|
||||
#### 1. Dataset and Validator Configuration
|
||||
|
||||
```python
|
||||
# Configuration supporting multiple runs (example)
|
||||
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets
|
||||
|
||||
datasets = sum(
|
||||
(v for k, v in locals().items() if k.endswith('_datasets')),
|
||||
[],
|
||||
)
|
||||
|
||||
# LLM validator configuration. Users need to deploy API services via LMDeploy/vLLM/SGLang or use OpenAI-compatible endpoints
|
||||
verifier_cfg = dict(
|
||||
abbr='qwen2-5-32B-Instruct',
|
||||
type=OpenAISDK,
|
||||
path='Qwen/Qwen2.5-32B-Instruct', # Replace with actual path
|
||||
key='YOUR_API_KEY', # Use real API key
|
||||
openai_api_base=['http://your-api-endpoint'], # Replace with API endpoint
|
||||
query_per_second=16,
|
||||
batch_size=1024,
|
||||
temperature=0.001,
|
||||
max_out_len=16384
|
||||
)
|
||||
|
||||
# Apply validator to all datasets
|
||||
for item in datasets:
|
||||
if 'judge_cfg' in item['eval_cfg']['evaluator']:
|
||||
item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg
|
||||
```
|
||||
|
||||
#### 2. Model Configuration
|
||||
|
||||
We provided an example of evaluation based on LMDeploy as the reasoning model backend, users can modify path (i.e., HF path)
|
||||
|
||||
```python
|
||||
# LMDeploy model configuration example
|
||||
models = [
|
||||
dict(
|
||||
type=TurboMindModelwithChatTemplate,
|
||||
abbr='deepseek-r1-distill-qwen-7b-turbomind',
|
||||
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
|
||||
engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
|
||||
gen_config=dict(
|
||||
do_sample=True,
|
||||
temperature=0.6,
|
||||
top_p=0.95,
|
||||
max_new_tokens=32768
|
||||
),
|
||||
max_seq_len=32768,
|
||||
batch_size=64,
|
||||
run_cfg=dict(num_gpus=1),
|
||||
pred_postprocessor=dict(type=extract_non_reasoning_content)
|
||||
),
|
||||
# Extendable 14B/32B configurations...
|
||||
]
|
||||
```
|
||||
|
||||
#### 3. Evaluation Process Configuration
|
||||
|
||||
```python
|
||||
# Inference configuration
|
||||
infer = dict(
|
||||
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
|
||||
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
|
||||
|
||||
# Evaluation configuration
|
||||
eval = dict(
|
||||
partitioner=dict(type=NaivePartitioner, n=8),
|
||||
runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)))
|
||||
```
|
||||
|
||||
#### 4. Summary Configuration
|
||||
|
||||
```python
|
||||
# Multiple runs results average configuration
|
||||
summary_groups = [
|
||||
{
|
||||
'name': 'AIME2024-Aveage8',
|
||||
'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
|
||||
},
|
||||
# Other dataset average configurations...
|
||||
]
|
||||
|
||||
summarizer = dict(
|
||||
dataset_abbrs=[
|
||||
['AIME2024-Aveage8', 'naive_average'],
|
||||
# Other dataset metrics...
|
||||
],
|
||||
summary_groups=summary_groups
|
||||
)
|
||||
|
||||
# Work directory configuration
|
||||
work_dir = "outputs/deepseek_r1_reasoning"
|
||||
```
|
||||
|
||||
## Evaluation Execution
|
||||
|
||||
### Scenario 1: Model loaded on 1 GPU, data evaluated by 1 worker, using a total of 1 GPU
|
||||
|
||||
```bash
|
||||
opencompass examples/eval_deepseek_r1.py --debug --dump-eval-details
|
||||
```
|
||||
|
||||
Evaluation logs will be output in the command line.
|
||||
|
||||
### Scenario 2: Model loaded on 1 GPU, data evaluated by 8 workers, using a total of 8 GPUs
|
||||
|
||||
You need to modify the `infer` configuration in the configuration file and set `num_worker` to 8
|
||||
|
||||
```python
|
||||
# Inference configuration
|
||||
infer = dict(
|
||||
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
|
||||
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
|
||||
```
|
||||
|
||||
At the same time, remove the `--debug` parameter from the evaluation command
|
||||
|
||||
```bash
|
||||
opencompass examples/eval_deepseek_r1.py --dump-eval-details
|
||||
```
|
||||
|
||||
In this mode, OpenCompass will use multithreading to start `$num_worker` tasks. Specific logs will not be displayed in the command line, instead, detailed evaluation logs will be shown under `$work_dir`.
|
||||
|
||||
### Scenario 3: Model loaded on 2 GPUs, data evaluated by 4 workers, using a total of 8 GPUs
|
||||
|
||||
Note that in the model configuration, `num_gpus` in `run_cfg` needs to be set to 2 (if using an inference backend, parameters such as `tp` in LMDeploy also need to be modified accordingly to 2), and at the same time, set `num_worker` in the `infer` configuration to 4
|
||||
|
||||
```python
|
||||
models += [
|
||||
dict(
|
||||
type=TurboMindModelwithChatTemplate,
|
||||
abbr='deepseek-r1-distill-qwen-14b-turbomind',
|
||||
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
|
||||
engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
|
||||
gen_config=dict(
|
||||
do_sample=True,
|
||||
temperature=0.6,
|
||||
top_p=0.95,
|
||||
max_new_tokens=32768),
|
||||
max_seq_len=32768,
|
||||
max_out_len=32768,
|
||||
batch_size=128,
|
||||
run_cfg=dict(num_gpus=2),
|
||||
pred_postprocessor=dict(type=extract_non_reasoning_content)
|
||||
),
|
||||
]
|
||||
```
|
||||
|
||||
```python
|
||||
# Inference configuration
|
||||
infer = dict(
|
||||
partitioner=dict(type=NumWorkerPartitioner, num_worker=4),
|
||||
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
|
||||
```
|
||||
|
||||
### Evaluation Results
|
||||
|
||||
The evaluation results are displayed as follows:
|
||||
|
||||
```bash
|
||||
dataset version metric mode deepseek-r1-distill-qwen-7b-turbomind ---------------------------------- --------- ------------- ------ --------------------------------------- MATH - - - AIME2024-Aveage8 - naive_average gen 56.25
|
||||
|
||||
```
|
||||
|
||||
## Performance Baseline
|
||||
|
||||
Since the model uses Sampling for decoding, and the AIME dataset size is small, there may still be a performance fluctuation of 1-3 points even when averaging over 8 evaluations.
|
||||
|
||||
| Model | Dataset | Metric | Value |
|
||||
| ---------------------------- | -------- | -------- | ----- |
|
||||
| DeepSeek-R1-Distill-Qwen-7B | AIME2024 | Accuracy | 56.3 |
|
||||
| DeepSeek-R1-Distill-Qwen-14B | AIME2024 | Accuracy | 74.2 |
|
||||
| DeepSeek-R1-Distill-Qwen-32B | AIME2024 | Accuracy | 74.2 |
|
65
docs/zh_cn/advanced_guides/persistence.md
Normal file
65
docs/zh_cn/advanced_guides/persistence.md
Normal file
@ -0,0 +1,65 @@
|
||||
# 评测结果持久化
|
||||
|
||||
## 介绍
|
||||
|
||||
通常情况下,OpenCompass的评测结果将会保存到工作目录下。 但在某些情况下,可能会产生用户间的数据共享,以及快速查看已有的公共评测结果等需求。 因此,我们提供了一个能够将评测结果快速转存到外部公共数据站的接口,并且在此基础上提供了对数据站的上传、更新、读取等功能。
|
||||
|
||||
## 快速开始
|
||||
|
||||
### 向数据站存储数据
|
||||
|
||||
通过在CLI评测指令中添加`args`或在Eval脚本中添加配置,即可将本次评测结果存储到您所指定的路径,示例如下:
|
||||
|
||||
(方式1)在指令中添加`args`选项并指定你的公共路径地址。
|
||||
|
||||
```bash
|
||||
opencompass ... -sp '/your_path'
|
||||
```
|
||||
|
||||
(方式2)在Eval脚本中添加配置。
|
||||
|
||||
```pythonE
|
||||
station_path = '/your_path'
|
||||
```
|
||||
|
||||
### 向数据站更新数据
|
||||
|
||||
上述存储方法在上传数据前会首先根据模型和数据集配置中的`abbr`属性来判断数据站中是否已有相同任务结果。若已有结果,则取消本次存储。如果您需要更新这部分结果,请在指令中添加`station-overwrite`选项,示例如下:
|
||||
|
||||
```bash
|
||||
opencompass ... -sp '/your_path' --station-overwrite
|
||||
```
|
||||
|
||||
### 读取数据站中已有的结果
|
||||
|
||||
您可以直接从数据站中读取已有的结果,以避免重复进行评测任务。读取到的结果会直接参与到`summarize`步骤。采用该配置时,仅有数据站中未存储结果的任务会被启动。示例如下:
|
||||
|
||||
```bash
|
||||
opencompass ... -sp '/your_path' --read-from-station
|
||||
```
|
||||
|
||||
### 指令组合
|
||||
|
||||
1. 仅向数据站上传最新工作目录下结果,不补充运行缺失结果的任务:
|
||||
|
||||
```bash
|
||||
opencompass ... -sp '/your_path' -r latest -m viz
|
||||
```
|
||||
|
||||
## 数据站存储格式
|
||||
|
||||
在数据站中,评测结果按照每个`model-dataset`对的结果存储为`json`文件。具体的目录组织形式为`/your_path/dataset_name/model_name.json`。每个`json`文件都存储了对应结果的字典,包括`predictions`、`results`以及`cfg`三个子项,具体示例如下:
|
||||
|
||||
```pythonE
|
||||
Result = {
|
||||
'predictions': List[Dict],
|
||||
'results': Dict,
|
||||
'cfg': Dict = {
|
||||
'models': Dict,
|
||||
'datasets': Dict,
|
||||
(Only subjective datasets)'judge_models': Dict
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
其中,`predictions`记录了模型对数据集中每一条数据的prediction的结果,`results`记录了模型在该数据集上的评分,`cfg`记录了该评测任务中模型和数据集的详细配置。
|
@ -40,6 +40,7 @@ OpenCompass 上手路线
|
||||
user_guides/evaluation.md
|
||||
user_guides/experimentation.md
|
||||
user_guides/metrics.md
|
||||
user_guides/deepseek_r1.md
|
||||
|
||||
.. _提示词:
|
||||
.. toctree::
|
||||
@ -66,6 +67,7 @@ OpenCompass 上手路线
|
||||
advanced_guides/code_eval.md
|
||||
advanced_guides/code_eval_service.md
|
||||
advanced_guides/subjective_evaluation.md
|
||||
advanced_guides/persistence.md
|
||||
|
||||
.. _工具:
|
||||
.. toctree::
|
||||
|
192
docs/zh_cn/user_guides/deepseek_r1.md
Normal file
192
docs/zh_cn/user_guides/deepseek_r1.md
Normal file
@ -0,0 +1,192 @@
|
||||
# 强推理模型评测教程
|
||||
|
||||
OpenCompass提供针对DeepSeek R1系列推理模型的评测教程(数学数据集)。
|
||||
|
||||
- 在模型层面,我们建议使用Sampling方式,以减少因为Greedy评测带来的大量重复
|
||||
- 在数据集层面,我们对数据量较小的评测基准,使用多次评测并取平均的方式。
|
||||
- 在答案验证层面,为了减少基于规则评测带来的误判,我们统一使用基于LLM验证的方式进行评测。
|
||||
|
||||
## 安装和准备
|
||||
|
||||
请按OpenCompass安装教程进行安装。
|
||||
|
||||
## 构建评测配置
|
||||
|
||||
我们在 `example/eval_deepseek_r1.py` 中提供了示例配置,以下对评测配置进行解读
|
||||
|
||||
### 评测配置解读
|
||||
|
||||
#### 1. 数据集与验证器配置
|
||||
|
||||
```python
|
||||
# 支持多运行次数的数据集配置(示例)
|
||||
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets
|
||||
|
||||
datasets = sum(
|
||||
(v for k, v in locals().items() if k.endswith('_datasets')),
|
||||
[],
|
||||
)
|
||||
|
||||
# 设置LLM验证器, 用户需事先通过LMDeploy/vLLM/SGLang等工具启动API 评测服务器,或者直接使用兼容OpenAI标准接口的模型服务
|
||||
verifier_cfg = dict(
|
||||
abbr='qwen2-5-32B-Instruct',
|
||||
type=OpenAISDK,
|
||||
path='Qwen/Qwen2.5-32B-Instruct', # 需替换实际路径
|
||||
key='YOUR_API_KEY', # 需替换真实API Key
|
||||
openai_api_base=['http://your-api-endpoint'], # 需替换API地址
|
||||
query_per_second=16,
|
||||
batch_size=1024,
|
||||
temperature=0.001,
|
||||
max_out_len=16384
|
||||
)
|
||||
|
||||
# 应用验证器到所有数据集
|
||||
for item in datasets:
|
||||
if 'judge_cfg' in item['eval_cfg']['evaluator']:
|
||||
item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg
|
||||
```
|
||||
|
||||
#### 2. 模型配置
|
||||
|
||||
我们提供了基于LMDeploy作为推理后端的评测示例,用户可以通过修改path(即HF路径)
|
||||
|
||||
```python
|
||||
# LMDeploy模型配置示例
|
||||
models = [
|
||||
dict(
|
||||
type=TurboMindModelwithChatTemplate,
|
||||
abbr='deepseek-r1-distill-qwen-7b-turbomind',
|
||||
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
|
||||
engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
|
||||
gen_config=dict(
|
||||
do_sample=True,
|
||||
temperature=0.6,
|
||||
top_p=0.95,
|
||||
max_new_tokens=32768
|
||||
),
|
||||
max_seq_len=32768,
|
||||
batch_size=64,
|
||||
run_cfg=dict(num_gpus=1),
|
||||
pred_postprocessor=dict(type=extract_non_reasoning_content)
|
||||
),
|
||||
# 可扩展14B/32B配置...
|
||||
]
|
||||
```
|
||||
|
||||
#### 3. 评估流程配置
|
||||
|
||||
```python
|
||||
# 推理配置
|
||||
infer = dict(
|
||||
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
|
||||
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
|
||||
|
||||
# 评估配置
|
||||
eval = dict(
|
||||
partitioner=dict(type=NaivePartitioner, n=8),
|
||||
runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)))
|
||||
```
|
||||
|
||||
#### 4. 结果汇总配置
|
||||
|
||||
```python
|
||||
# 多运行结果平均配置
|
||||
summary_groups = [
|
||||
{
|
||||
'name': 'AIME2024-Aveage8',
|
||||
'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
|
||||
},
|
||||
# 其他数据集平均配置...
|
||||
]
|
||||
|
||||
summarizer = dict(
|
||||
dataset_abbrs=[
|
||||
['AIME2024-Aveage8', 'naive_average'],
|
||||
# 其他数据集指标...
|
||||
],
|
||||
summary_groups=summary_groups
|
||||
)
|
||||
|
||||
# 工作目录设置
|
||||
work_dir = "outputs/deepseek_r1_reasoning"
|
||||
```
|
||||
|
||||
## 执行评测
|
||||
|
||||
### 场景1:模型1卡加载,数据1个worker评测,共使用1个GPU
|
||||
|
||||
```bash
|
||||
opencompass example/eval_deepseek_r1.py --debug --dump-eval-details
|
||||
```
|
||||
|
||||
评测日志会在命令行输出。
|
||||
|
||||
### 场景2:模型1卡加载,数据8个worker评测,共使用8个GPU
|
||||
|
||||
需要修改配置文件中的infer配置,将num_worker设置为8
|
||||
|
||||
```python
|
||||
# 推理配置
|
||||
infer = dict(
|
||||
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
|
||||
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
|
||||
```
|
||||
|
||||
同时评测命令去掉`--debug`参数
|
||||
|
||||
```bash
|
||||
opencompass example/eval_deepseek_r1.py --dump-eval-details
|
||||
```
|
||||
|
||||
此模式下,OpenCompass将使用多线程启动`$num_worker`个任务,命令行不展示具体日志,具体的评测日志将会在`$work_dir`下中展示。
|
||||
|
||||
### 场景3:模型2卡加载,数据4个worker评测,共使用8个GPU
|
||||
|
||||
需要注意模型配置中,`run_cfg`中的`num_gpus`需要设置为2(如使用推理后端,则推理后端的参数也需要同步修改,比如LMDeploy中的tp需要设置为2),同时修改`infer`配置中的`num_worker`为4
|
||||
|
||||
```python
|
||||
models += [
|
||||
dict(
|
||||
type=TurboMindModelwithChatTemplate,
|
||||
abbr='deepseek-r1-distill-qwen-14b-turbomind',
|
||||
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
|
||||
engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
|
||||
gen_config=dict(
|
||||
do_sample=True,
|
||||
temperature=0.6,
|
||||
top_p=0.95,
|
||||
max_new_tokens=32768),
|
||||
max_seq_len=32768,
|
||||
max_out_len=32768,
|
||||
batch_size=128,
|
||||
run_cfg=dict(num_gpus=2),
|
||||
pred_postprocessor=dict(type=extract_non_reasoning_content)
|
||||
),
|
||||
]
|
||||
```
|
||||
|
||||
```python
|
||||
# 推理配置
|
||||
infer = dict(
|
||||
partitioner=dict(type=NumWorkerPartitioner, num_worker=4),
|
||||
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
|
||||
```
|
||||
|
||||
### 评测结果
|
||||
|
||||
评测结果展示如下:
|
||||
|
||||
```bash
|
||||
dataset version metric mode deepseek-r1-distill-qwen-7b-turbomind ---------------------------------- --------- ------------- ------ --------------------------------------- MATH - - - AIME2024-Aveage8 - naive_average gen 56.25
|
||||
|
||||
```
|
||||
|
||||
## 性能基线参考
|
||||
|
||||
由于模型使用Sampling进行解码,同时AIME数据量较小,使用8次评测取平均情况下,仍会出现1-3分的性能抖动
|
||||
|
||||
| 模型 | 数据集 | 指标 | 数值 |
|
||||
| ---------------------------- | -------- | -------- | ---- |
|
||||
| DeepSeek-R1-Distill-Qwen-7B | AIME2024 | Accuracy | 56.3 |
|
||||
| DeepSeek-R1-Distill-Qwen-14B | AIME2024 | Accuracy | 74.2 |
|
||||
| DeepSeek-R1-Distill-Qwen-32B | AIME2024 | Accuracy | 74.2 |
|
212
examples/eval_deepseek_r1.py
Normal file
212
examples/eval_deepseek_r1.py
Normal file
@ -0,0 +1,212 @@
|
||||
# Support AIME-2024 with Repeat8
|
||||
# Support MATH-500
|
||||
# Support OlympiadBench
|
||||
# Support OmniMath
|
||||
# Support LiveMathBench-202412-Hard
|
||||
|
||||
import os.path as osp
|
||||
from itertools import product
|
||||
from opencompass.models import OpenAISDK
|
||||
from mmengine.config import read_base
|
||||
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
|
||||
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
|
||||
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
|
||||
from opencompass.runners import LocalRunner
|
||||
from opencompass.models import (
|
||||
TurboMindModelwithChatTemplate,
|
||||
)
|
||||
|
||||
#######################################################################
|
||||
# PART 1 Datasets List #
|
||||
#######################################################################
|
||||
with read_base():
|
||||
# You can comment out the datasets you don't want to evaluate
|
||||
|
||||
# Datasets
|
||||
# from opencompass.configs.datasets.math.math_prm800k_500_llmverify_gen_6ff468 import math_datasets # 1 Run
|
||||
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets # 8 Run
|
||||
# from opencompass.configs.datasets.OlympiadBench.OlympiadBench_0shot_llmverify_gen_be8b13 import olympiadbench_datasets
|
||||
# from opencompass.configs.datasets.omni_math.omni_math_llmverify_gen_ccf9c0 import omnimath_datasets # 1 Run
|
||||
# from opencompass.configs.datasets.livemathbench.livemathbench_hard_custom_llmverify_gen_85d0ef import livemathbench_datasets
|
||||
|
||||
|
||||
# Summarizer
|
||||
from opencompass.configs.summarizers.groups.OlympiadBench import OlympiadBenchMath_summary_groups
|
||||
|
||||
datasets = sum(
|
||||
(v for k, v in locals().items() if k.endswith('_datasets')),
|
||||
[],
|
||||
)
|
||||
|
||||
# Set LLM Verifier used for each dataset
|
||||
|
||||
verifier_cfg = dict(
|
||||
abbr='qwen2-5-32B-Instruct',
|
||||
type=OpenAISDK,
|
||||
path='Qwen/Qwen2.5-32B-Instruct', # You need to set your own judge model path
|
||||
key='sk-1234', # You need to set your own API key
|
||||
openai_api_base=[
|
||||
'http://172.30.56.1:4000/v1', # You need to set your own API base
|
||||
],
|
||||
meta_template=dict(
|
||||
round=[
|
||||
dict(role='HUMAN', api_role='HUMAN'),
|
||||
dict(role='BOT', api_role='BOT', generate=True),
|
||||
],
|
||||
),
|
||||
query_per_second=16,
|
||||
batch_size=1024,
|
||||
temperature=0.001,
|
||||
tokenizer_path='gpt-4o-2024-05-13',
|
||||
verbose=True,
|
||||
max_out_len=16384,
|
||||
# max_seq_len=32768,
|
||||
max_seq_len=49152,
|
||||
)
|
||||
|
||||
for item in datasets:
|
||||
# item['infer_cfg']['inferencer']['max_out_len'] = 32768 # You can unset this line if you want to avoid length cutoff
|
||||
if 'judge_cfg' in item['eval_cfg']['evaluator']:
|
||||
item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg
|
||||
|
||||
|
||||
#######################################################################
|
||||
# PART 2 Model List #
|
||||
#######################################################################
|
||||
|
||||
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
|
||||
|
||||
models += [
|
||||
# You can comment out the models you don't want to evaluate
|
||||
# All models use sampling mode
|
||||
dict(
|
||||
type=TurboMindModelwithChatTemplate,
|
||||
abbr='deepseek-r1-distill-qwen-7b-turbomind',
|
||||
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
|
||||
engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
|
||||
gen_config=dict(
|
||||
do_sample=True,
|
||||
temperature=0.6,
|
||||
top_p=0.95,
|
||||
max_new_tokens=32768),
|
||||
max_seq_len=32768,
|
||||
max_out_len=32768,
|
||||
batch_size=64,
|
||||
run_cfg=dict(num_gpus=1),
|
||||
pred_postprocessor=dict(type=extract_non_reasoning_content)
|
||||
),
|
||||
# dict(
|
||||
# type=TurboMindModelwithChatTemplate,
|
||||
# abbr='deepseek-r1-distill-qwen-14b-turbomind',
|
||||
# path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
|
||||
# engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
|
||||
# gen_config=dict(
|
||||
# do_sample=True,
|
||||
# temperature=0.6,
|
||||
# top_p=0.95,
|
||||
# max_new_tokens=32768),
|
||||
# max_seq_len=32768,
|
||||
# max_out_len=32768,
|
||||
# batch_size=128,
|
||||
# run_cfg=dict(num_gpus=2),
|
||||
# pred_postprocessor=dict(type=extract_non_reasoning_content)
|
||||
# ),
|
||||
# dict(
|
||||
# type=TurboMindModelwithChatTemplate,
|
||||
# abbr='deepseek-r1-distill-qwen-32b-turbomind',
|
||||
# path='deepseek-ai/DeepSeek-R1-Distill-Qwen-32B',
|
||||
# engine_config=dict(session_len=32768, max_batch_size=128, tp=4),
|
||||
# gen_config=dict(
|
||||
# do_sample=True,
|
||||
# temperature=0.6,
|
||||
# top_p=0.95,
|
||||
# max_new_tokens=16384),
|
||||
# max_seq_len=32768,
|
||||
# max_out_len=16384,
|
||||
# batch_size=128,
|
||||
# run_cfg=dict(num_gpus=4),
|
||||
# pred_postprocessor=dict(type=extract_non_reasoning_content)
|
||||
# ),
|
||||
]
|
||||
|
||||
#######################################################################
|
||||
# PART 3 Inference/Evaluation #
|
||||
#######################################################################
|
||||
|
||||
# Inference configuration
|
||||
infer = dict(
|
||||
partitioner=dict(
|
||||
type=NumWorkerPartitioner,
|
||||
num_worker=1
|
||||
# Similar with data-parallelism, how many workers for evaluation,
|
||||
# each worker will evaluate a part of the dataset. Total GPUs = num_worker * num_gpus_per_worker
|
||||
# For example, If you have 8 GPUs, for 7B model using 1 GPU for one instance, you can set num_worker=8
|
||||
# to max-utilize the GPUs.
|
||||
# If you have 8 GPUs, for 14B model using 2 GPUs for one instance, you can set num_worker=4
|
||||
),
|
||||
runner=dict(
|
||||
type=LocalRunner,
|
||||
task=dict(type=OpenICLInferTask)
|
||||
),
|
||||
)
|
||||
|
||||
# Evaluation configuration
|
||||
eval = dict(
|
||||
partitioner=dict(
|
||||
type=NaivePartitioner, n=8
|
||||
),
|
||||
runner=dict(
|
||||
type=LocalRunner,
|
||||
task=dict(
|
||||
type=OpenICLEvalTask)
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
#######################################################################
|
||||
# PART 4 Summarizer #
|
||||
#######################################################################
|
||||
|
||||
|
||||
summary_groups = sum(
|
||||
[v for k, v in locals().items() if k.endswith('_summary_groups')], []
|
||||
)
|
||||
|
||||
summary_groups.extend([
|
||||
{
|
||||
'name': 'AIME2024-Aveage8',
|
||||
'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
|
||||
},
|
||||
{
|
||||
'name': 'LiveMathBench-v202412-Hard-Aveage8',
|
||||
'subsets':[[
|
||||
f'livemathbench_hard_custom_{split}_run{run_idx}', 'accuracy']
|
||||
for split, run_idx in product(['hard_cn', 'hard_en'], range(8))
|
||||
]
|
||||
}
|
||||
])
|
||||
|
||||
# Summarizer
|
||||
summarizer = dict(
|
||||
dataset_abbrs=[
|
||||
'MATH',
|
||||
# ['LiveMathBench-k1-n1', 'pass@1'],
|
||||
# ['LiveMathBench-v202412-greedy', 'G-Pass@1_0.0'],
|
||||
# ['aime2024', 'accuracy'],
|
||||
['math_prm800k_500-llmjudge', 'accuracy'],
|
||||
['AIME2024-Aveage8', 'naive_average'],
|
||||
['LiveMathBench-v202412-Hard-Aveage8', 'naive_average'],
|
||||
['OlympiadBenchMath', 'accuracy'],
|
||||
['OmniMath', 'accuracy'],
|
||||
],
|
||||
summary_groups=summary_groups,
|
||||
)
|
||||
|
||||
|
||||
#######################################################################
|
||||
# PART 5 Utils #
|
||||
#######################################################################
|
||||
|
||||
work_dir = 'outputs/deepseek_r1_reasoning'
|
||||
|
||||
|
@ -1 +1 @@
|
||||
__version__ = '0.4.0'
|
||||
__version__ = '0.4.1'
|
||||
|
@ -12,7 +12,8 @@ from mmengine.config import Config, DictAction
|
||||
from opencompass.registry import PARTITIONERS, RUNNERS, build_from_cfg
|
||||
from opencompass.runners import SlurmRunner
|
||||
from opencompass.summarizers import DefaultSummarizer
|
||||
from opencompass.utils import LarkReporter, get_logger
|
||||
from opencompass.utils import (LarkReporter, get_logger, read_from_station,
|
||||
save_to_station)
|
||||
from opencompass.utils.run import (fill_eval_cfg, fill_infer_cfg,
|
||||
get_config_from_arg)
|
||||
|
||||
@ -127,6 +128,27 @@ def parse_args():
|
||||
'correctness of each sample, bpb, etc.',
|
||||
action='store_true',
|
||||
)
|
||||
|
||||
parser.add_argument('-sp',
|
||||
'--station-path',
|
||||
help='Path to your results station.',
|
||||
type=str,
|
||||
default=None,
|
||||
)
|
||||
|
||||
parser.add_argument('--station-overwrite',
|
||||
help='Whether to overwrite the results at station.',
|
||||
action='store_true',
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
'--read-from-station',
|
||||
help='Whether to save the evaluation results to the '
|
||||
'data station.',
|
||||
action='store_true',
|
||||
)
|
||||
|
||||
|
||||
# set srun args
|
||||
slurm_parser = parser.add_argument_group('slurm_args')
|
||||
parse_slurm_args(slurm_parser)
|
||||
@ -260,6 +282,12 @@ def main():
|
||||
# types cannot be serialized
|
||||
cfg = Config.fromfile(output_config_path, format_python_code=False)
|
||||
|
||||
# get existed results from station
|
||||
if args.read_from_station:
|
||||
existing_results_list = read_from_station(cfg, args)
|
||||
rs_exist_results = [comb['combination'] for comb in existing_results_list]
|
||||
cfg['rs_exist_results'] = rs_exist_results
|
||||
|
||||
# report to lark bot if specify --lark
|
||||
if not args.lark:
|
||||
cfg['lark_bot_url'] = None
|
||||
@ -267,6 +295,7 @@ def main():
|
||||
content = f'{getpass.getuser()}\'s task has been launched!'
|
||||
LarkReporter(cfg['lark_bot_url']).post(content)
|
||||
|
||||
# infer
|
||||
if args.mode in ['all', 'infer']:
|
||||
# When user have specified --slurm or --dlc, or have not set
|
||||
# "infer" in config, we will provide a default configuration
|
||||
@ -348,6 +377,10 @@ def main():
|
||||
else:
|
||||
runner(tasks)
|
||||
|
||||
# save to station
|
||||
if args.station_path is not None or cfg.get('station_path') is not None:
|
||||
save_to_station(cfg, args)
|
||||
|
||||
# visualize
|
||||
if args.mode in ['all', 'eval', 'viz']:
|
||||
summarizer_cfg = cfg.get('summarizer', {})
|
||||
|
5
opencompass/configs/datasets/HLE/hle_gen.py
Normal file
5
opencompass/configs/datasets/HLE/hle_gen.py
Normal file
@ -0,0 +1,5 @@
|
||||
from mmengine.config import read_base
|
||||
|
||||
with read_base():
|
||||
# Default use LLM as a judge
|
||||
from .hle_llmverify_gen_6ff468 import hle_datasets # noqa: F401, F403
|
91
opencompass/configs/datasets/HLE/hle_llmverify_gen_6ff468.py
Normal file
91
opencompass/configs/datasets/HLE/hle_llmverify_gen_6ff468.py
Normal file
@ -0,0 +1,91 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.evaluator import GenericLLMEvaluator
|
||||
from opencompass.datasets import generic_llmjudge_postprocess
|
||||
from opencompass.datasets import HLEDataset
|
||||
|
||||
# ----------------------------- Detailed Config -----------------------------
|
||||
|
||||
math_reader_cfg = dict(input_columns=['problem'], output_column='answer')
|
||||
|
||||
math_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{problem}\nRemember to put your final answer within \\boxed{}.'),
|
||||
]
|
||||
),
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer),
|
||||
)
|
||||
|
||||
GRADER_TEMPLATE = """
|
||||
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
|
||||
|
||||
Here are some evaluation criteria:
|
||||
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
|
||||
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
|
||||
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
|
||||
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
|
||||
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
|
||||
|
||||
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
|
||||
A: CORRECT
|
||||
B: INCORRECT
|
||||
Just return the letters "A" or "B", with no text around it.
|
||||
|
||||
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
|
||||
|
||||
|
||||
<Original Question Begin>: \n{problem}\n<Original Question End>\n\n
|
||||
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
|
||||
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
|
||||
|
||||
Judging the correctness of candidates' answers:
|
||||
""".strip()
|
||||
|
||||
# Evaluation configuration
|
||||
math_eval_cfg = dict(
|
||||
evaluator=dict(
|
||||
type=GenericLLMEvaluator,
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[
|
||||
dict(
|
||||
role='SYSTEM',
|
||||
fallback_role='HUMAN',
|
||||
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
|
||||
],
|
||||
round=[
|
||||
dict(
|
||||
role='HUMAN',
|
||||
prompt = GRADER_TEMPLATE
|
||||
),
|
||||
]),
|
||||
),
|
||||
dataset_cfg=dict(
|
||||
type=HLEDataset,
|
||||
path='cais/hle',
|
||||
reader_cfg=math_reader_cfg,
|
||||
),
|
||||
judge_cfg=dict(),
|
||||
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
|
||||
),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
|
||||
hle_datasets = [
|
||||
dict(
|
||||
type=HLEDataset,
|
||||
abbr='hle_llmjudge',
|
||||
path='cais/hle',
|
||||
reader_cfg=math_reader_cfg,
|
||||
infer_cfg=math_infer_cfg,
|
||||
eval_cfg=math_eval_cfg,
|
||||
)
|
||||
]
|
@ -0,0 +1,105 @@
|
||||
from mmengine.config import read_base
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets import OlympiadBenchDataset, OlympiadBenchEvaluator, olympiadbench_postprocess_v2
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.evaluator import GenericLLMEvaluator
|
||||
from opencompass.datasets import generic_llmjudge_postprocess
|
||||
|
||||
with read_base():
|
||||
from .OlympiadBench_categories import math_categories as categories
|
||||
|
||||
# Create prompter instance for problems
|
||||
olympiadbench_prompter_cfg = dict(
|
||||
type='OlympiadBenchPrompter'
|
||||
)
|
||||
|
||||
olympiadbench_reader_cfg = dict(
|
||||
input_columns=[
|
||||
'problem', 'language', 'subject', 'question_type',
|
||||
'answer_type', 'is_multiple_answer', 'unit', 'questions'
|
||||
],
|
||||
output_column='solution'
|
||||
)
|
||||
|
||||
GRADER_TEMPLATE = """
|
||||
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
|
||||
|
||||
Here are some evaluation criteria:
|
||||
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
|
||||
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
|
||||
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
|
||||
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
|
||||
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
|
||||
|
||||
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
|
||||
A: CORRECT
|
||||
B: INCORRECT
|
||||
Just return the letters "A" or "B", with no text around it.
|
||||
|
||||
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
|
||||
|
||||
|
||||
<Original Question Begin>: \n{problem}\n<Original Question End>\n\n
|
||||
<Gold Target Begin>: \n{solution}\n<Gold Target End>\n\n
|
||||
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
|
||||
|
||||
Judging the correctness of candidates' answers:
|
||||
""".strip()
|
||||
|
||||
|
||||
olympiadbenchMath_datasets = []
|
||||
for _name in categories:
|
||||
olympiadbench_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type='OlympiadBenchTemplate'
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer),
|
||||
)
|
||||
|
||||
# Evaluation configuration
|
||||
olympiadbench_eval_cfg = dict(
|
||||
evaluator=dict(
|
||||
type=GenericLLMEvaluator,
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[
|
||||
dict(
|
||||
role='SYSTEM',
|
||||
fallback_role='HUMAN',
|
||||
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
|
||||
],
|
||||
round=[
|
||||
dict(
|
||||
role='HUMAN',
|
||||
prompt = GRADER_TEMPLATE
|
||||
),
|
||||
]),
|
||||
),
|
||||
dataset_cfg=dict(
|
||||
type=OlympiadBenchDataset,
|
||||
path='opencompass/OlympiadBench',
|
||||
name=_name,
|
||||
reader_cfg=olympiadbench_reader_cfg,
|
||||
),
|
||||
judge_cfg=dict(),
|
||||
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
|
||||
),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
olympiadbenchMath_datasets.append(
|
||||
dict(
|
||||
type=OlympiadBenchDataset,
|
||||
abbr=f'OlympiadBench_{_name}',
|
||||
path='opencompass/OlympiadBench',
|
||||
name=_name,
|
||||
reader_cfg=olympiadbench_reader_cfg,
|
||||
infer_cfg=olympiadbench_infer_cfg,
|
||||
eval_cfg=olympiadbench_eval_cfg,
|
||||
)
|
||||
)
|
||||
|
||||
del _name
|
@ -5,3 +5,14 @@ categories = [
|
||||
'OE_TO_physics_en_COMP', # OpenEnded - TextOnly - physics - COMP
|
||||
'OE_TO_physics_zh_CEE' # OpenEnded - TextOnly - physics - CEE
|
||||
]
|
||||
|
||||
math_categories = [
|
||||
'OE_TO_maths_en_COMP', # OpenEnded - TextOnly - maths - COMP
|
||||
'OE_TO_maths_zh_COMP', # OpenEnded - TextOnly - maths - COMP
|
||||
'OE_TO_maths_zh_CEE', # OpenEnded - TextOnly - maths - CEE
|
||||
]
|
||||
|
||||
physics_categories = [
|
||||
'OE_TO_physics_en_COMP', # OpenEnded - TextOnly - physics - COMP
|
||||
'OE_TO_physics_zh_CEE' # OpenEnded - TextOnly - physics - CEE
|
||||
]
|
||||
|
@ -1,53 +1,43 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets import (
|
||||
BigCodeBenchDataset,
|
||||
BigCodeBenchEvaluator
|
||||
)
|
||||
|
||||
from opencompass.datasets import (BigCodeBenchDataset, BigCodeBenchEvaluator)
|
||||
|
||||
bigcodebench_full_reader_cfg = dict(
|
||||
input_columns=['complete_prompt'],
|
||||
output_column='test',
|
||||
input_columns=['complete_prompt'],
|
||||
output_column='test',
|
||||
)
|
||||
|
||||
|
||||
bigcodebench_full_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[dict(role='system',
|
||||
fallback_role='HUMAN',
|
||||
prompt='')],
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{complete_prompt}'),
|
||||
]
|
||||
)
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer, max_out_len=1024)
|
||||
)
|
||||
bigcodebench_full_infer_cfg = dict(prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[dict(role='system', fallback_role='HUMAN', prompt='')],
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{complete_prompt}'),
|
||||
])),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer,
|
||||
max_out_len=1024))
|
||||
|
||||
bigcodebench_full_eval_cfg = dict(
|
||||
evaluator=dict(
|
||||
type=BigCodeBenchEvaluator,
|
||||
release_version='v0.1.2',
|
||||
eval_type='complete',
|
||||
remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
|
||||
# remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
|
||||
remote_execute_api=
|
||||
'https://opencompass-opencompass-bigcodebench-evaluator.hf.space', # noqa: E501
|
||||
dataset_version='full',
|
||||
),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
bigcodebench_full_complete_datasets = [
|
||||
dict(
|
||||
abbr='bigcodebench_full_complete',
|
||||
type=BigCodeBenchDataset,
|
||||
path='opencompass/bigcodebench',
|
||||
reader_cfg=bigcodebench_full_reader_cfg,
|
||||
infer_cfg=bigcodebench_full_infer_cfg,
|
||||
eval_cfg=bigcodebench_full_eval_cfg,
|
||||
release_version='v0.1.2'
|
||||
)
|
||||
]
|
||||
dict(abbr='bigcodebench_full_complete',
|
||||
type=BigCodeBenchDataset,
|
||||
path='opencompass/bigcodebench',
|
||||
reader_cfg=bigcodebench_full_reader_cfg,
|
||||
infer_cfg=bigcodebench_full_infer_cfg,
|
||||
eval_cfg=bigcodebench_full_eval_cfg,
|
||||
release_version='v0.1.2')
|
||||
]
|
||||
|
@ -1,53 +1,43 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets import (
|
||||
BigCodeBenchDataset,
|
||||
BigCodeBenchEvaluator
|
||||
)
|
||||
|
||||
from opencompass.datasets import (BigCodeBenchDataset, BigCodeBenchEvaluator)
|
||||
|
||||
bigcodebench_full_reader_cfg = dict(
|
||||
input_columns=['instruct_prompt'],
|
||||
output_column='test',
|
||||
input_columns=['instruct_prompt'],
|
||||
output_column='test',
|
||||
)
|
||||
|
||||
|
||||
bigcodebench_full_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[dict(role='system',
|
||||
fallback_role='HUMAN',
|
||||
prompt='')],
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{instruct_prompt}'),
|
||||
]
|
||||
)
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer, max_out_len=8192)
|
||||
)
|
||||
bigcodebench_full_infer_cfg = dict(prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[dict(role='system', fallback_role='HUMAN', prompt='')],
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{instruct_prompt}'),
|
||||
])),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer,
|
||||
max_out_len=8192))
|
||||
|
||||
bigcodebench_full_eval_cfg = dict(
|
||||
evaluator=dict(
|
||||
type=BigCodeBenchEvaluator,
|
||||
release_version='v0.1.2',
|
||||
eval_type='instruct',
|
||||
remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
|
||||
# remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
|
||||
remote_execute_api=
|
||||
'https://opencompass-opencompass-bigcodebench-evaluator.hf.space', # noqa: E501
|
||||
dataset_version='full',
|
||||
),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
bigcodebench_full_instruct_datasets = [
|
||||
dict(
|
||||
abbr='bigcodebench_full_instruct',
|
||||
type=BigCodeBenchDataset,
|
||||
path='opencompass/bigcodebench',
|
||||
reader_cfg=bigcodebench_full_reader_cfg,
|
||||
infer_cfg=bigcodebench_full_infer_cfg,
|
||||
eval_cfg=bigcodebench_full_eval_cfg,
|
||||
release_version='v0.1.2'
|
||||
)
|
||||
]
|
||||
dict(abbr='bigcodebench_full_instruct',
|
||||
type=BigCodeBenchDataset,
|
||||
path='opencompass/bigcodebench',
|
||||
reader_cfg=bigcodebench_full_reader_cfg,
|
||||
infer_cfg=bigcodebench_full_infer_cfg,
|
||||
eval_cfg=bigcodebench_full_eval_cfg,
|
||||
release_version='v0.1.2')
|
||||
]
|
||||
|
@ -1,40 +1,32 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets import (
|
||||
BigCodeBenchDataset,
|
||||
BigCodeBenchEvaluator
|
||||
)
|
||||
|
||||
from opencompass.datasets import (BigCodeBenchDataset, BigCodeBenchEvaluator)
|
||||
|
||||
bigcodebench_hard_reader_cfg = dict(
|
||||
input_columns=['complete_prompt'],
|
||||
output_column='test',
|
||||
input_columns=['complete_prompt'],
|
||||
output_column='test',
|
||||
)
|
||||
|
||||
|
||||
bigcodebench_hard_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[dict(role='system',
|
||||
fallback_role='HUMAN',
|
||||
prompt='')],
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{complete_prompt}'),
|
||||
]
|
||||
)
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer, max_out_len=1024)
|
||||
)
|
||||
bigcodebench_hard_infer_cfg = dict(prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[dict(role='system', fallback_role='HUMAN', prompt='')],
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{complete_prompt}'),
|
||||
])),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer,
|
||||
max_out_len=1024))
|
||||
|
||||
bigcodebench_hard_eval_cfg = dict(
|
||||
evaluator=dict(
|
||||
type=BigCodeBenchEvaluator,
|
||||
release_version='v0.1.2',
|
||||
eval_type='complete',
|
||||
remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
|
||||
# remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
|
||||
remote_execute_api=
|
||||
'https://opencompass-opencompass-bigcodebench-evaluator.hf.space', # noqa: E501
|
||||
dataset_version='hard',
|
||||
),
|
||||
pred_role='BOT',
|
||||
@ -51,4 +43,4 @@ bigcodebench_hard_complete_datasets = [
|
||||
release_version='v0.1.2',
|
||||
dataset_version='hard',
|
||||
)
|
||||
]
|
||||
]
|
||||
|
@ -1,40 +1,32 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets import (
|
||||
BigCodeBenchDataset,
|
||||
BigCodeBenchEvaluator
|
||||
)
|
||||
|
||||
from opencompass.datasets import (BigCodeBenchDataset, BigCodeBenchEvaluator)
|
||||
|
||||
bigcodebench_hard_reader_cfg = dict(
|
||||
input_columns=['instruct_prompt'],
|
||||
output_column='test',
|
||||
input_columns=['instruct_prompt'],
|
||||
output_column='test',
|
||||
)
|
||||
|
||||
|
||||
bigcodebench_hard_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[dict(role='system',
|
||||
fallback_role='HUMAN',
|
||||
prompt='')],
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{instruct_prompt}'),
|
||||
]
|
||||
)
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer, max_out_len=8192)
|
||||
)
|
||||
bigcodebench_hard_infer_cfg = dict(prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[dict(role='system', fallback_role='HUMAN', prompt='')],
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='{instruct_prompt}'),
|
||||
])),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer,
|
||||
max_out_len=8192))
|
||||
|
||||
bigcodebench_hard_eval_cfg = dict(
|
||||
evaluator=dict(
|
||||
type=BigCodeBenchEvaluator,
|
||||
release_version='v0.1.2',
|
||||
eval_type='instruct',
|
||||
remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
|
||||
# remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
|
||||
remote_execute_api=
|
||||
'https://opencompass-opencompass-bigcodebench-evaluator.hf.space', # noqa: E501
|
||||
dataset_version='hard',
|
||||
),
|
||||
pred_role='BOT',
|
||||
@ -51,4 +43,4 @@ bigcodebench_hard_instruct_datasets = [
|
||||
release_version='v0.1.2',
|
||||
dataset_version='hard',
|
||||
)
|
||||
]
|
||||
]
|
||||
|
@ -0,0 +1,132 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.datasets import (LCBCodeGenerationDataset,
|
||||
LCBCodeExecutionDataset,
|
||||
LCBTestOutputPredictionDataset,
|
||||
LCBCodeGenerationEvaluator,
|
||||
LCBCodeExecutionEvaluator,
|
||||
LCBTestOutputEvaluator)
|
||||
|
||||
lcb_code_generation_reader_cfg = dict(
|
||||
input_columns=[
|
||||
'question_content',
|
||||
'format_prompt',
|
||||
],
|
||||
# output_column='evaluation_sample',
|
||||
output_column='question_id',
|
||||
)
|
||||
|
||||
SYSTEM_MESSAGE_GENERIC = 'You are an expert Python programmer. You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests. You will NOT return anything except for the program.' # noqa: E501
|
||||
|
||||
prompt_template = '### Question:\n{question_content}\n\n{format_prompt}' + \
|
||||
'### Answer: (use the provided format with backticks)\n\n'
|
||||
|
||||
# Code Generation Tasks
|
||||
lcb_code_generation_infer_cfg = dict(prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(round=[dict(role='HUMAN', prompt=prompt_template)])),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer))
|
||||
|
||||
lcb_code_generation_eval_cfg = dict(
|
||||
evaluator=dict(type=LCBCodeGenerationEvaluator,
|
||||
num_process_evaluate=4,
|
||||
timeout=6,
|
||||
release_version='release_v5',
|
||||
start_date='2024-08-01',
|
||||
end_date='2025-02-01'),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
LCBCodeGeneration_dataset = dict(
|
||||
type=LCBCodeGenerationDataset,
|
||||
abbr='lcb_code_generation',
|
||||
path='opencompass/code_generation_lite',
|
||||
reader_cfg=lcb_code_generation_reader_cfg,
|
||||
infer_cfg=lcb_code_generation_infer_cfg,
|
||||
eval_cfg=lcb_code_generation_eval_cfg,
|
||||
release_version='release_v5',
|
||||
)
|
||||
|
||||
# Code Execution Dataset
|
||||
lcb_code_execution_reader_cfg = dict(
|
||||
input_columns=[
|
||||
'prompt',
|
||||
],
|
||||
output_column='evaluation_sample',
|
||||
)
|
||||
|
||||
lcb_code_execution_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[
|
||||
dict(
|
||||
role='SYSTEM',
|
||||
fallback_role='HUMAN',
|
||||
prompt=
|
||||
'You are an expert at Python programming, code execution, test case generation, and fuzzing.' # noqa: E501
|
||||
),
|
||||
],
|
||||
round=[dict(role='HUMAN', prompt='{prompt}')])),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer))
|
||||
|
||||
lcb_code_execution_eval_cfg = dict(
|
||||
evaluator=dict(type=LCBCodeExecutionEvaluator, ),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
LCBCodeExecution_dataset = dict(
|
||||
type=LCBCodeExecutionDataset,
|
||||
abbr='lcb_code_execution',
|
||||
path='opencompass/execution-v2',
|
||||
reader_cfg=lcb_code_execution_reader_cfg,
|
||||
infer_cfg=lcb_code_execution_infer_cfg,
|
||||
eval_cfg=lcb_code_execution_eval_cfg,
|
||||
)
|
||||
|
||||
# TestOuputput Dataset
|
||||
lcb_test_output_reader_cfg = dict(
|
||||
input_columns=[
|
||||
'prompt',
|
||||
],
|
||||
output_column='evaluation_sample',
|
||||
)
|
||||
|
||||
system_prompt = 'You are an expert Python programmer. You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests. You will NOT return anything except for the program.' # noqa: E501
|
||||
|
||||
lcb_test_output_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
# begin=[
|
||||
# dict(
|
||||
# role='SYSTEM',
|
||||
# prompt=system_prompt
|
||||
# ),
|
||||
# ],
|
||||
round=[dict(role='HUMAN', prompt='{prompt}')])),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer))
|
||||
|
||||
lcb_test_output_eval_cfg = dict(
|
||||
evaluator=dict(type=LCBTestOutputEvaluator, ),
|
||||
pred_role='BOT',
|
||||
)
|
||||
|
||||
LCBTestOutput_dataset = dict(
|
||||
type=LCBTestOutputPredictionDataset,
|
||||
abbr='lcb_test_output',
|
||||
path='opencompass/test_generation',
|
||||
reader_cfg=lcb_test_output_reader_cfg,
|
||||
infer_cfg=lcb_test_output_infer_cfg,
|
||||
eval_cfg=lcb_test_output_eval_cfg,
|
||||
)
|
||||
|
||||
LCB_datasets = [
|
||||
LCBCodeGeneration_dataset,
|
||||
LCBCodeExecution_dataset,
|
||||
LCBTestOutput_dataset,
|
||||
]
|
@ -0,0 +1,96 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.evaluator import GenericLLMEvaluator
|
||||
from opencompass.datasets import CustomDataset
|
||||
from opencompass.datasets import generic_llmjudge_postprocess
|
||||
from itertools import product
|
||||
|
||||
livemathbench_reader_cfg = dict(input_columns=['question'], output_column='answer')
|
||||
|
||||
|
||||
# Inference configuration
|
||||
livemathbench_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
round=[
|
||||
dict(
|
||||
role='HUMAN',
|
||||
prompt='{question}\n',
|
||||
),
|
||||
]
|
||||
),
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer),
|
||||
)
|
||||
|
||||
|
||||
# Template for the LLM judge
|
||||
GRADER_TEMPLATE = """
|
||||
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
|
||||
|
||||
Here are some evaluation criteria:
|
||||
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
|
||||
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
|
||||
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
|
||||
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
|
||||
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
|
||||
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
|
||||
A: CORRECT
|
||||
B: INCORRECT
|
||||
Just return the letters "A" or "B", with no text around it.
|
||||
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
|
||||
<Original Question Begin>: \n{question}\n<Original Question End>\n\n
|
||||
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
|
||||
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
|
||||
|
||||
Judging the correctness of candidates' answers:
|
||||
""".strip()
|
||||
|
||||
|
||||
|
||||
splits = ['hard_cn', 'hard_en']
|
||||
# Dataset configuration
|
||||
livemathbench_datasets = [
|
||||
dict(
|
||||
type=CustomDataset,
|
||||
abbr=f'livemathbench_hard_custom_{split}_run{run_idx}',
|
||||
path='data/LiveMathBench',
|
||||
local_mode=True,
|
||||
file_name=f'202412/{split}.jsonl',
|
||||
reader_cfg=livemathbench_reader_cfg,
|
||||
infer_cfg=livemathbench_infer_cfg,
|
||||
eval_cfg=dict(
|
||||
# # Evaluation configuration using LLM as judge
|
||||
evaluator=dict(
|
||||
type=GenericLLMEvaluator,
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[
|
||||
dict(
|
||||
role='SYSTEM',
|
||||
fallback_role='HUMAN',
|
||||
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
|
||||
)
|
||||
],
|
||||
round=[
|
||||
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
|
||||
],
|
||||
),
|
||||
),
|
||||
dataset_cfg=dict(
|
||||
type=CustomDataset,
|
||||
path='data/LiveMathBench',
|
||||
local_mode=True,
|
||||
file_name=f'202412/{split}.jsonl',
|
||||
reader_cfg=livemathbench_reader_cfg,
|
||||
),
|
||||
judge_cfg={},
|
||||
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
|
||||
),
|
||||
),
|
||||
) for split, run_idx in product(splits, range(8))
|
||||
]
|
@ -9,7 +9,7 @@ livemathbench_dataset = dict(
|
||||
type=LiveMathBenchDataset,
|
||||
path='',
|
||||
k=16,
|
||||
replication=3,
|
||||
n=48,
|
||||
dataset_splits=['hard'],
|
||||
dataset_languages=['cn', 'en'],
|
||||
cot=True,
|
||||
@ -37,13 +37,7 @@ livemathbench_dataset = dict(
|
||||
evaluator=dict(
|
||||
type=LiveMathBenchEvaluator,
|
||||
model_name='',
|
||||
url=[],
|
||||
use_extract_model=False,
|
||||
extract_url=[],
|
||||
extract_model_name='',
|
||||
k=[4, 8, 16],
|
||||
replication=3,
|
||||
thresholds=[0.0, 0.25, 0.5, 0.75, 1.0]
|
||||
url=[]
|
||||
)
|
||||
)
|
||||
)
|
||||
|
@ -9,7 +9,7 @@ livemathbench_dataset = dict(
|
||||
type=LiveMathBenchDataset,
|
||||
path='',
|
||||
k=1,
|
||||
replication=1,
|
||||
n=1,
|
||||
dataset_splits=['hard'],
|
||||
dataset_languages=['cn', 'en'],
|
||||
cot=True,
|
||||
@ -37,13 +37,7 @@ livemathbench_dataset = dict(
|
||||
evaluator=dict(
|
||||
type=LiveMathBenchEvaluator,
|
||||
model_name='',
|
||||
url=[],
|
||||
use_extract_model=False,
|
||||
extract_url=[],
|
||||
extract_model_name='',
|
||||
k=[1],
|
||||
replication=1,
|
||||
thresholds=[0.0]
|
||||
url=[]
|
||||
)
|
||||
)
|
||||
)
|
||||
|
@ -88,7 +88,7 @@ math_eval_cfg = dict(
|
||||
math_datasets = [
|
||||
dict(
|
||||
type=MATHDataset,
|
||||
abbr=f'math_prm800k_500-llmjudge-run{idx}',
|
||||
abbr=f'math_prm800k_500-llmverify-run{idx}',
|
||||
path='opencompass/math',
|
||||
file_name = 'test_prm800k_500.json',
|
||||
reader_cfg=math_reader_cfg,
|
||||
|
@ -0,0 +1,89 @@
|
||||
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||
from opencompass.evaluator import GenericLLMEvaluator
|
||||
from opencompass.datasets import generic_llmjudge_postprocess
|
||||
from opencompass.datasets.omni_math import OmniMathDataset
|
||||
|
||||
|
||||
omnimath_reader_cfg = dict(
|
||||
input_columns=['problem'],
|
||||
output_column='answer'
|
||||
)
|
||||
|
||||
omnimath_infer_cfg = dict(
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
round=[
|
||||
dict(role='HUMAN', prompt='please answer the following mathematical question, put your final answer in \\boxed{}.\n\n{problem}'),
|
||||
]
|
||||
)
|
||||
),
|
||||
retriever=dict(type=ZeroRetriever),
|
||||
inferencer=dict(type=GenInferencer)
|
||||
)
|
||||
|
||||
|
||||
|
||||
GRADER_TEMPLATE = """
|
||||
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
|
||||
|
||||
Here are some evaluation criteria:
|
||||
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
|
||||
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
|
||||
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
|
||||
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
|
||||
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
|
||||
|
||||
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
|
||||
A: CORRECT
|
||||
B: INCORRECT
|
||||
Just return the letters "A" or "B", with no text around it.
|
||||
|
||||
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
|
||||
|
||||
|
||||
<Original Question Begin>: \n{problem}\n<Original Question End>\n\n
|
||||
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
|
||||
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
|
||||
|
||||
Judging the correctness of candidates' answers:
|
||||
""".strip()
|
||||
|
||||
omnimath_eval_cfg = dict(
|
||||
evaluator=dict(
|
||||
type=GenericLLMEvaluator,
|
||||
prompt_template=dict(
|
||||
type=PromptTemplate,
|
||||
template=dict(
|
||||
begin=[
|
||||
dict(
|
||||
role='SYSTEM',
|
||||
fallback_role='HUMAN',
|
||||
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
|
||||
],
|
||||
round=[
|
||||
dict(
|
||||
role='HUMAN',
|
||||
prompt = GRADER_TEMPLATE
|
||||
),
|
||||
]),
|
||||
),
|
||||
dataset_cfg=dict(
|
||||
type=OmniMathDataset,
|
||||
reader_cfg=omnimath_reader_cfg,
|
||||
),
|
||||
judge_cfg=dict(),
|
||||
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
|
||||
),
|
||||
)
|
||||
omnimath_datasets = [
|
||||
dict(
|
||||
type=OmniMathDataset,
|
||||
abbr='OmniMath',
|
||||
reader_cfg=omnimath_reader_cfg,
|
||||
infer_cfg=omnimath_infer_cfg,
|
||||
eval_cfg=omnimath_eval_cfg
|
||||
)
|
||||
]
|
@ -9,3 +9,14 @@ categories = [
|
||||
OlympiadBench_summary_groups = [
|
||||
{'name': 'OlympiadBench', 'subsets': ['OlympiadBench_' + c.replace(' ', '_') for c in categories]},
|
||||
]
|
||||
|
||||
math_categories = [
|
||||
'OE_TO_maths_en_COMP', # OpenEnded - TextOnly - maths - COMP
|
||||
'OE_TO_maths_zh_COMP', # OpenEnded - TextOnly - maths - COMP
|
||||
'OE_TO_maths_zh_CEE', # OpenEnded - TextOnly - maths - CEE
|
||||
]
|
||||
|
||||
|
||||
OlympiadBenchMath_summary_groups = [
|
||||
{'name': 'OlympiadBenchMath', 'subsets': ['OlympiadBench_' + c.replace(' ', '_') for c in math_categories]},
|
||||
]
|
||||
|
@ -65,7 +65,7 @@ class TheoremQAEvaluatorV3(BaseEvaluator):
|
||||
{
|
||||
# "question": question,
|
||||
# "solution": output,
|
||||
"correct": groundtruth,
|
||||
# "correct": groundtruth,
|
||||
"pred": answer,
|
||||
"is_correct": is_correct,
|
||||
}
|
||||
|
@ -57,6 +57,7 @@ from .gpqa import * # noqa: F401, F403
|
||||
from .gsm8k import * # noqa: F401, F403
|
||||
from .gsm_hard import * # noqa: F401, F403
|
||||
from .hellaswag import * # noqa: F401, F403
|
||||
from .hle import * # noqa: F401, F403
|
||||
from .huggingface import * # noqa: F401, F403
|
||||
from .humaneval import * # noqa: F401, F403
|
||||
from .humaneval_multi import * # noqa: F401, F403
|
||||
|
@ -197,11 +197,21 @@ class BigCodeBenchEvaluator(BaseEvaluator):
|
||||
break
|
||||
except (httpx.ReadTimeout, CancelledError):
|
||||
logger.info('Read timeout error. Retrying in 4s...')
|
||||
time.sleep(4)
|
||||
time.sleep(10)
|
||||
|
||||
if 'pass@1' in pass_at_k.keys():
|
||||
pass_at_k['pass@1'] *= 100
|
||||
dump_results = {'details': results}
|
||||
dump_results = {'details': self._results_processor(results)}
|
||||
dump_results.update(pass_at_k)
|
||||
|
||||
return dump_results
|
||||
|
||||
def _results_processor(self, results):
|
||||
details = []
|
||||
for key, value in results['eval'].items():
|
||||
if value[0]['status'] == 'pass':
|
||||
value[0]['correct'] = True
|
||||
else:
|
||||
value[0]['correct'] = False
|
||||
details.append(value[0])
|
||||
return details
|
||||
|
@ -1,5 +1,7 @@
|
||||
import re
|
||||
|
||||
from opencompass.utils import get_logger
|
||||
|
||||
|
||||
def get_final_results(judged_answers,
|
||||
references,
|
||||
@ -68,7 +70,13 @@ def generic_llmjudge_postprocess(
|
||||
processed_judge = _generic_llmjudge_postprocess(v['prediction'])
|
||||
if processed_judge is not None:
|
||||
judged_answers.append(processed_judge)
|
||||
references.append(v['gold'])
|
||||
try:
|
||||
references.append(v['gold'])
|
||||
|
||||
except KeyError:
|
||||
get_logger().warning(
|
||||
f'No gold answer for {k}, use empty string as reference!')
|
||||
references.append('')
|
||||
results = get_final_results(judged_answers, references, origial_responses)
|
||||
results['details'] = output
|
||||
return results
|
||||
|
17
opencompass/datasets/hle.py
Normal file
17
opencompass/datasets/hle.py
Normal file
@ -0,0 +1,17 @@
|
||||
from datasets import load_dataset
|
||||
|
||||
from opencompass.registry import LOAD_DATASET
|
||||
|
||||
from .base import BaseDataset
|
||||
|
||||
|
||||
@LOAD_DATASET.register_module()
|
||||
class HLEDataset(BaseDataset):
|
||||
|
||||
@staticmethod
|
||||
def load(path: str):
|
||||
dataset = load_dataset(path)
|
||||
dataset['test'] = dataset['test'].filter(lambda x: x['image'] == '')
|
||||
dataset['test'] = dataset['test'].rename_column('question', 'problem')
|
||||
dataset['train'] = dataset['test']
|
||||
return dataset
|
@ -146,9 +146,12 @@ def evaluate_generations(
|
||||
with ProcessPoolExecutor(
|
||||
max_workers=1 if debug else num_process_evaluate) as executor:
|
||||
futures = {
|
||||
executor.submit(evaluate_generations_by_problem,
|
||||
problem_generations, sample, debug, timeout):
|
||||
index
|
||||
executor.submit(
|
||||
evaluate_generations_by_problem, # noqa: E501
|
||||
problem_generations,
|
||||
sample,
|
||||
debug,
|
||||
timeout): index
|
||||
for (problem_generations, sample, debug,
|
||||
timeout), index in inputs
|
||||
}
|
||||
@ -233,15 +236,27 @@ class LCBCodeGenerationEvaluator(BaseEvaluator):
|
||||
num_process_evaluate,
|
||||
timeout=6,
|
||||
release_version='release_v1',
|
||||
extractor_version='v1'):
|
||||
extractor_version='v1',
|
||||
start_date=None,
|
||||
end_date=None):
|
||||
super().__init__()
|
||||
self.num_process_evaluate = num_process_evaluate
|
||||
self.timeout = timeout
|
||||
self.dataset = LCBCodeGenerationDataset.load(
|
||||
release_version=release_version)['test']
|
||||
release_version=release_version,
|
||||
start_date=start_date,
|
||||
end_date=end_date)['test']
|
||||
self.extractor_version = extractor_version
|
||||
|
||||
def score(self, predictions, references):
|
||||
if len(predictions) != len(references):
|
||||
return {
|
||||
'error':
|
||||
'predictions and references have different '
|
||||
f'length. len(predictions): {len(predictions)}, '
|
||||
f'len(references): {len(references)}'
|
||||
}
|
||||
|
||||
if self.extractor_version == 'v1':
|
||||
predictions = [[extract_code_generation(item)]
|
||||
for item in predictions]
|
||||
@ -254,19 +269,28 @@ class LCBCodeGenerationEvaluator(BaseEvaluator):
|
||||
evaluation_samples[self.dataset[idx][
|
||||
'question_id']] = self.dataset[idx]['evaluation_sample']
|
||||
|
||||
references = [evaluation_samples[item] for item in references]
|
||||
filtered_predictions = []
|
||||
filtered_references = []
|
||||
for idx, item in enumerate(references):
|
||||
if item in self.dataset['question_id']:
|
||||
filtered_predictions.append(predictions[idx])
|
||||
filtered_references.append(item)
|
||||
|
||||
references = [{'input_output': item} for item in references]
|
||||
filtered_references = [
|
||||
evaluation_samples[item] for item in filtered_references
|
||||
] # noqa: E501
|
||||
|
||||
BaseEvaluator.is_num_equal(predictions, references)
|
||||
filtered_references = [{
|
||||
'input_output': item
|
||||
} for item in filtered_references] # noqa: E501
|
||||
|
||||
extracted_predictions = {}
|
||||
for idx, content in enumerate(predictions):
|
||||
for idx, content in enumerate(filtered_predictions):
|
||||
extracted_predictions[idx] = content
|
||||
|
||||
metrics, eval_results, final_metadata = codegen_metrics(
|
||||
references,
|
||||
predictions,
|
||||
filtered_references,
|
||||
filtered_predictions,
|
||||
k_list=[1],
|
||||
num_process_evaluate=self.num_process_evaluate,
|
||||
timeout=self.timeout,
|
||||
|
@ -6,6 +6,7 @@ import json
|
||||
import pickle
|
||||
import zlib
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime
|
||||
from enum import Enum
|
||||
|
||||
from datasets import DatasetDict, load_dataset, load_from_disk
|
||||
@ -53,7 +54,9 @@ class LCBCodeGenerationDataset(BaseDataset):
|
||||
@staticmethod
|
||||
def load(path: str = 'opencompass/code_generation_lite',
|
||||
local_mode: bool = False,
|
||||
release_version: str = 'release_v1'):
|
||||
release_version: str = 'release_v1',
|
||||
start_date: str = None,
|
||||
end_date: str = None):
|
||||
|
||||
def transform(item):
|
||||
# Define the dataitem mapping logic
|
||||
@ -61,7 +64,7 @@ class LCBCodeGenerationDataset(BaseDataset):
|
||||
# starter_code
|
||||
if item['starter_code']:
|
||||
format_prompt = f'### Format: {CodeGenerationPromptConstants.FORMATTING_MESSAGE_WITH_STARTER_CODE}\n' # noqa: E501
|
||||
format_prompt += f"```python\n{item['starter_code']}\n```\n\n"
|
||||
format_prompt += f"```python\n{item['starter_code']}\n```\n\n" # noqa: Q000, E501
|
||||
else:
|
||||
format_prompt = f'### Format: {CodeGenerationPromptConstants.FORMATTING_WITHOUT_STARTER_CODE}\n' # noqa: E501
|
||||
format_prompt += '```python\n# YOUR CODE HERE\n```\n\n'
|
||||
@ -107,6 +110,16 @@ class LCBCodeGenerationDataset(BaseDataset):
|
||||
|
||||
dataset = dataset.map(transform)
|
||||
|
||||
if start_date is not None:
|
||||
p_start_date = datetime.strptime(start_date, '%Y-%m-%d')
|
||||
dataset = dataset.filter(
|
||||
lambda e: p_start_date <= datetime.fromisoformat(e[
|
||||
'contest_date'])) # noqa: E501
|
||||
if end_date is not None:
|
||||
p_end_date = datetime.strptime(end_date, '%Y-%m-%d')
|
||||
dataset = dataset.filter(lambda e: datetime.fromisoformat(e[
|
||||
'contest_date']) <= p_end_date) # noqa: E501
|
||||
|
||||
return DatasetDict({'test': dataset, 'train': dataset})
|
||||
|
||||
|
||||
|
@ -41,9 +41,8 @@ class LiveMathBenchDataset(BaseDataset):
|
||||
dataset = []
|
||||
dataset_info = {}
|
||||
|
||||
if path != '':
|
||||
path = get_data_path(path)
|
||||
path = os.path.join(path, version)
|
||||
# Use dataset mapping to generate path
|
||||
data_dir = get_data_path(path)
|
||||
|
||||
for split, language in product(dataset_splits, dataset_languages):
|
||||
dataset_info[f'{split}_{language}'] = {
|
||||
@ -59,8 +58,17 @@ class LiveMathBenchDataset(BaseDataset):
|
||||
'问答': 'problem-solving'
|
||||
}
|
||||
|
||||
if path != '':
|
||||
file_path = os.path.join(path, f'{split}_{language}.jsonl')
|
||||
examples = []
|
||||
if data_dir.startswith('opencompass/'):
|
||||
# Using HF Dataset
|
||||
hf_dataset = load_dataset(
|
||||
data_dir, f'v{version}_{split}_{language}')['test']
|
||||
for example in hf_dataset:
|
||||
examples.append(example)
|
||||
else:
|
||||
file_path = os.path.join(data_dir, version,
|
||||
f'{split}_{language}.jsonl')
|
||||
|
||||
if not os.path.exists(file_path):
|
||||
raise FileNotFoundError(
|
||||
f'File {file_path} does not exist, please check the '
|
||||
@ -69,13 +77,6 @@ class LiveMathBenchDataset(BaseDataset):
|
||||
with jsonlines.open(file_path, 'r') as file:
|
||||
for example in file:
|
||||
examples.append(example)
|
||||
else:
|
||||
hf_dataset = load_dataset(
|
||||
'opencompass/LiveMathBench',
|
||||
f'v{version}_{split}_{language}')['test']
|
||||
examples = []
|
||||
for example in hf_dataset:
|
||||
examples.append(example)
|
||||
|
||||
for example_idx, example in enumerate(examples):
|
||||
dataset_info[f'{split}_{language}'][
|
||||
|
@ -130,6 +130,7 @@ class TurboMindModelwithChatTemplate(BaseModel):
|
||||
if self.fastchat_template:
|
||||
messages = _format_with_fast_chat_template(messages, self.fastchat_template)
|
||||
else:
|
||||
# NOTE: DeepSeek-R1 series model's chat template will add <think> after the
|
||||
messages = [self.tokenizer.apply_chat_template(m, add_generation_prompt=True, tokenize=False) for m in messages]
|
||||
# LMDeploy tokenize prompts by AutoTokenizer with its default parameter "add_special_token=True"
|
||||
# OC add bos_token in the prompt, which requires tokenizing prompts using "add_speicial_token=False"
|
||||
|
@ -2,8 +2,12 @@
|
||||
|
||||
import json
|
||||
|
||||
from opencompass.openicl.icl_evaluator import BaseEvaluator
|
||||
from opencompass.registry import ICL_EVALUATORS
|
||||
|
||||
class TEvalEvaluator:
|
||||
|
||||
@ICL_EVALUATORS.register_module()
|
||||
class TEvalEvaluator(BaseEvaluator):
|
||||
"""This module contains the following evaluators for evaluating the
|
||||
capabilities of the various dimensions of the LLM.
|
||||
|
||||
|
@ -102,6 +102,7 @@ class BasePartitioner:
|
||||
return tasks
|
||||
|
||||
def parse_model_dataset_args(self, cfg: ConfigDict):
|
||||
|
||||
models = cfg['models']
|
||||
datasets = cfg['datasets']
|
||||
|
||||
@ -109,7 +110,24 @@ class BasePartitioner:
|
||||
if 'model_dataset_combinations' in sig.parameters:
|
||||
combs = cfg.get('model_dataset_combinations', None)
|
||||
if combs is None:
|
||||
combs = [{'models': models, 'datasets': datasets}]
|
||||
if 'rs_exist_results' in cfg.keys():
|
||||
rs_exist_results = cfg['rs_exist_results']
|
||||
combs = []
|
||||
for model in models:
|
||||
comb = {'models': [model], 'datasets': datasets}
|
||||
combs.append(comb)
|
||||
for i in range(len(combs)):
|
||||
combs[i]['datasets'] = [
|
||||
dataset for dataset in combs[i]['datasets'] if [
|
||||
model_abbr_from_cfg(combs[i]['models'][0]),
|
||||
dataset_abbr_from_cfg(dataset)
|
||||
] not in rs_exist_results
|
||||
]
|
||||
combs = [
|
||||
comb for comb in combs if len(comb['datasets']) != 0
|
||||
]
|
||||
else:
|
||||
combs = [{'models': models, 'datasets': datasets}]
|
||||
else:
|
||||
# sanity check
|
||||
model_abbrs = [model_abbr_from_cfg(model) for model in models]
|
||||
|
@ -14,4 +14,5 @@ from .model_postprocessors import * # noqa
|
||||
from .network import * # noqa
|
||||
from .postprocessors import * # noqa
|
||||
from .prompt import * # noqa
|
||||
from .result_station import * # noqa
|
||||
from .text_postprocessors import * # noqa
|
||||
|
@ -376,7 +376,7 @@ DATASETS_MAPPING = {
|
||||
"opencompass/LiveReasonBench": {
|
||||
"ms_id": "",
|
||||
"hf_id": "",
|
||||
"local": "./data/LiveReasonBench/",
|
||||
"local": "./data/LiveReasonBench/",
|
||||
},
|
||||
"opencompass/bigcodebench": {
|
||||
"ms_id": "",
|
||||
@ -412,251 +412,313 @@ DATASETS_MAPPING = {
|
||||
|
||||
DATASETS_URL = {
|
||||
"/OlympiadBench": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/OlympiadBench.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/OlympiadBench.zip",
|
||||
"md5": "97e8b1ae7f6170d94817288a8930ef00",
|
||||
},
|
||||
"/longbenchv2":{
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/longbenchv2.zip",
|
||||
"/longbenchv2": {
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/longbenchv2.zip",
|
||||
"md5": "09b7e06e6f98c5cca8ad597b3d7b42f0",
|
||||
},
|
||||
"/livestembench": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/livestembench.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/livestembench.zip",
|
||||
"md5": "0ff59d031c3dcff56a2e00e8c1489f5d",
|
||||
},
|
||||
"/musr": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/musr.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/musr.zip",
|
||||
"md5": "7447d2a5bec4586035196102135e2af9",
|
||||
},
|
||||
"/mmlu/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip",
|
||||
"md5": "761310671509a239e41c4b717f7fab9c",
|
||||
},
|
||||
"/mmmlu_lite": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmmlu_lite.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmmlu_lite.zip",
|
||||
"md5": "a776af1220e1826fd0608eda1bc4425e",
|
||||
},
|
||||
"/simpleqa": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/simpleqa.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/simpleqa.zip",
|
||||
"md5": "1d83fc2e15798d39cb265c9a3cb5195a",
|
||||
},
|
||||
"/chinese_simpleqa": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/chinese_simpleqa.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/chinese_simpleqa.zip",
|
||||
"md5": "4bdf854b291fc0ee29da57dc47ac47b5",
|
||||
},
|
||||
"/gpqa/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip",
|
||||
"md5": "2e9657959030a765916f1f2aca29140d",
|
||||
},
|
||||
"/CHARM/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/CHARM.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/CHARM.zip",
|
||||
"md5": "fdf51e955d1b8e0bb35bc1997eaf37cb",
|
||||
},
|
||||
"/ifeval/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ifeval.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ifeval.zip",
|
||||
"md5": "64d98b6f36b42e7390c9cef76cace75f",
|
||||
},
|
||||
"/mbpp/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mbpp.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mbpp.zip",
|
||||
"md5": "777739c90f04bce44096a5bc96c8f9e5",
|
||||
},
|
||||
"/cmmlu/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/cmmlu.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/cmmlu.zip",
|
||||
"md5": "a59f4003d6918509a719ce3bc2a5d5bc",
|
||||
},
|
||||
"/math/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip",
|
||||
"md5": "cb5b4c8378085929e20345174e731fdf",
|
||||
},
|
||||
"/hellaswag/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/hellaswag.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/hellaswag.zip",
|
||||
"md5": "2b700a02ffb58571c7df8d8d0619256f",
|
||||
},
|
||||
"/BBH/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/BBH.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/BBH.zip",
|
||||
"md5": "60c49f9bef5148aa7e1941328e96a554",
|
||||
},
|
||||
"/compass_arena/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/compass_arena.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/compass_arena.zip",
|
||||
"md5": "cd59b54a179d16f2a858b359b60588f6",
|
||||
},
|
||||
"/TheoremQA/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/TheoremQA.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/TheoremQA.zip",
|
||||
"md5": "f2793b07bc26510d507aa710d9bd8622",
|
||||
},
|
||||
"/mathbench_v1/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mathbench_v1.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mathbench_v1.zip",
|
||||
"md5": "50257a910ca43d1f61a610a79fdb16b5",
|
||||
},
|
||||
"/gsm8k/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip",
|
||||
"md5": "901e5dc93a2889789a469da9850cdca8",
|
||||
},
|
||||
"/LCBench2023/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/LCBench2023.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/LCBench2023.zip",
|
||||
"md5": "e1a38c94a42ad1809e9e0650476a9306",
|
||||
},
|
||||
"/humaneval/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/humaneval.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/humaneval.zip",
|
||||
"md5": "88b1b89dc47b7121c81da6bcd85a69c3",
|
||||
},
|
||||
"/humanevalx": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/humanevalx.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/humanevalx.zip",
|
||||
"md5": "22930355c03fb73fb5bae14b50f1deb9",
|
||||
},
|
||||
"/ds1000_data": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ds1000_data.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ds1000_data.zip",
|
||||
"md5": "1a4990aec04a2fd73ccfad12e2d43b43",
|
||||
},
|
||||
"/drop_simple_eval/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/drop_simple_eval.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/drop_simple_eval.zip",
|
||||
"md5": "c912afe5b4a63509851cf16e6b91830e",
|
||||
},
|
||||
"subjective/alignment_bench/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/alignment_bench.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/alignment_bench.zip",
|
||||
"md5": "d8ae9a0398526479dbbcdb80fafabceb",
|
||||
},
|
||||
"subjective/alpaca_eval": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/alpaca_eval.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/alpaca_eval.zip",
|
||||
"md5": "d7399d63cb46c82f089447160ef49b6a",
|
||||
},
|
||||
"subjective/arena_hard": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/arena_hard.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/arena_hard.zip",
|
||||
"md5": "02cd09a482cb0f0cd9d2c2afe7a1697f",
|
||||
},
|
||||
"subjective/mtbench": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mtbench.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mtbench.zip",
|
||||
"md5": "d1afc0787aeac7f1f24872742e161069",
|
||||
},
|
||||
"subjective/fofo": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/fofo.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/fofo.zip",
|
||||
"md5": "8a302712e425e27e4292a9369df5b9d3",
|
||||
},
|
||||
"subjective/followbench": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/followbench.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/followbench.zip",
|
||||
"md5": "da7a831817c969da15d1e78d4a245d8a",
|
||||
},
|
||||
"subjective/mtbench101": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mtbench101.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mtbench101.zip",
|
||||
"md5": "5d80257bc9929ebe5cfbf6d11184b04c",
|
||||
},
|
||||
"subjective/WildBench": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/wildbench.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/wildbench.zip",
|
||||
"md5": "b06252857f1f8f44a17b1bfca4888ff4",
|
||||
},
|
||||
"/ruler/": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ruler.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ruler.zip",
|
||||
"md5": "c60bdfff3d02358067104cc1dea7c0f7",
|
||||
},
|
||||
"/scicode": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/scicode.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/scicode.zip",
|
||||
"md5": "9c6c64b8c70edc418f713419ea39989c",
|
||||
},
|
||||
"/commonsenseqa": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/commonsenseqa.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/commonsenseqa.zip",
|
||||
"md5": "c4a82fc07c81ae1462605f5d7fd2bb2e",
|
||||
},
|
||||
"FewCLUE": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/FewCLUE.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/FewCLUE.zip",
|
||||
"md5": "7976e2bb0e9d885ffd3c55f7c5d4021e",
|
||||
},
|
||||
"/race": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/race.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/race.zip",
|
||||
"md5": "b758251764a264746cf45749c02363f9",
|
||||
},
|
||||
"/ARC": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ARC.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ARC.zip",
|
||||
"md5": "d720629b69f1a51cfe78bf65b00b44f6",
|
||||
},
|
||||
"/SuperGLUE": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/SuperGLUE.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/SuperGLUE.zip",
|
||||
"md5": "b60904915b0b61d1a04ea52280169936",
|
||||
},
|
||||
"SQuAD2.0": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/SQuAD2.0.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/SQuAD2.0.zip",
|
||||
"md5": "1321cbf9349e1102a57d31d1b2bfdd7e",
|
||||
},
|
||||
"mmlu_pro": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu_pro.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu_pro.zip",
|
||||
"md5": "e3200c7380f4cea5f13c768f2815fabb",
|
||||
},
|
||||
"/Longbench": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/Longbench.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/Longbench.zip",
|
||||
"md5": "ab0cb9e520ae5cfb899bf38b564249bb",
|
||||
},
|
||||
"/needlebench": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/needlebench.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/needlebench.zip",
|
||||
"md5": "dad5c903ebfea16eaf186b8997aeedad",
|
||||
},
|
||||
"/teval": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/teval.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/teval.zip",
|
||||
"md5": "7628ab5891a26bf96ca17becfd044867",
|
||||
},
|
||||
"/code_generation_lite": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/code_generation_lite.zip",
|
||||
"md5": "60103a18ca63b05ea06e98d24170f23d",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/code_generation_lite.zip",
|
||||
"md5": "ebcf8db56f5c817ca8202a542be30cb4",
|
||||
},
|
||||
"/execution-v2": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/execution-v2.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/execution-v2.zip",
|
||||
"md5": "019ef1a0686ee6ca34f51c8af104fcd9",
|
||||
},
|
||||
"/test_generation": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/test_generation.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/test_generation.zip",
|
||||
"md5": "918a6ea2b1eee6f2b1314db3c21cb4c7",
|
||||
},
|
||||
"/aime": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip",
|
||||
"md5": "fbe2d0577fc210962a549f8cea1a00c8",
|
||||
},
|
||||
"/cmo": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/cmo.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/cmo.zip",
|
||||
"md5": "fad52c81290506a8ca74f46b5400d8fc",
|
||||
},
|
||||
},
|
||||
"/nq-open": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/nq-open.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/nq-open.zip",
|
||||
"md5": "a340521e5c9ec591227dcb367f718b25",
|
||||
},
|
||||
"/winogrande": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/winogrande.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/winogrande.zip",
|
||||
"md5": "9e949a75eacc26ed4fd2b9aa870b495b",
|
||||
},
|
||||
"/triviaqa": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/triviaqa.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/triviaqa.zip",
|
||||
"md5": "e6a118d744236814926b2ec7ec66c034",
|
||||
},
|
||||
"/GAOKAO-BENCH": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/GAOKAO-BENCH.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/GAOKAO-BENCH.zip",
|
||||
"md5": "ba3c71b8b9db96d2a0664b977c4f9784",
|
||||
},
|
||||
"/WikiBench": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/WikiBench.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/WikiBench.zip",
|
||||
"md5": "6dac1d1a3133fe1effff185cbf71d928",
|
||||
},
|
||||
"/babilong": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/babilong.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/babilong.zip",
|
||||
"md5": "e400864c31bc58d29eaa3e199751f99b",
|
||||
},
|
||||
"/korbench": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/korbench.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/korbench.zip",
|
||||
"md5": "9107597d137e7362eaf7d218ddef7a6d",
|
||||
},
|
||||
"subjective/judgerbench": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/judgerbench.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/judgerbench.zip",
|
||||
"md5": "60d605883aa8cac9755819140ab42c6b"
|
||||
},
|
||||
"/arc_prize_public_evaluation": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/arc_prize_public_evaluation.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/arc_prize_public_evaluation.zip",
|
||||
"md5": "367a33977651496efddba7670009807e"
|
||||
},
|
||||
"P-MMEval": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/pmmeval.zip",
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/pmmeval.zip",
|
||||
"md5": "09e401e6229a50647b9e13c429e634d1",
|
||||
},
|
||||
"LiveMathBench": {
|
||||
'url': "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/LiveMathBench.zip",
|
||||
'url':
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/LiveMathBench.zip",
|
||||
"md5": "d0781f9185c9bb50e81e6e3ca8c59013",
|
||||
},
|
||||
"bigcodebench": {
|
||||
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/bigcodebench.zip",
|
||||
"md5": "2c1c7956ca49a1124617e8c037ec57d8"
|
||||
"url":
|
||||
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/bigcodebench.zip",
|
||||
"md5": "270f399f4142b74f47ecff116cc3b21d"
|
||||
}
|
||||
}
|
||||
|
417
opencompass/utils/result_station.py
Normal file
417
opencompass/utils/result_station.py
Normal file
@ -0,0 +1,417 @@
|
||||
import json
|
||||
import os
|
||||
import os.path as osp
|
||||
import re
|
||||
|
||||
from opencompass.utils.abbr import (dataset_abbr_from_cfg,
|
||||
deal_with_judge_model_abbr,
|
||||
model_abbr_from_cfg)
|
||||
|
||||
|
||||
def save_to_station(cfg, args):
|
||||
|
||||
if args.station_path is not None:
|
||||
station_path = args.station_path
|
||||
else:
|
||||
station_path = cfg.get('station_path')
|
||||
|
||||
work_dict = cfg['work_dir']
|
||||
|
||||
# objective dataset processing
|
||||
if 'judge_models' not in cfg.keys():
|
||||
model_list = [model_abbr_from_cfg(model) for model in cfg['models']]
|
||||
dataset_list = [
|
||||
dataset_abbr_from_cfg(dataset) for dataset in cfg['datasets']
|
||||
]
|
||||
|
||||
rs_exist_results = []
|
||||
if 'rs_exist_results' in cfg.keys():
|
||||
rs_exist_results = cfg['rs_exist_results']
|
||||
|
||||
for dataset in dataset_list:
|
||||
result_path = osp.join(station_path, dataset)
|
||||
if not osp.exists(result_path):
|
||||
os.makedirs(result_path)
|
||||
|
||||
for model in model_list:
|
||||
if ([model, dataset] in rs_exist_results
|
||||
and not args.station_overwrite):
|
||||
continue
|
||||
result_file_name = model + '.json'
|
||||
if osp.exists(osp.join(
|
||||
result_path,
|
||||
result_file_name)) and not args.station_overwrite:
|
||||
print('result of {} with {} already exists'.format(
|
||||
dataset, model))
|
||||
continue
|
||||
else:
|
||||
# get result dict
|
||||
local_result_path = osp.join(work_dict, 'results', model)
|
||||
local_result_json = osp.join(local_result_path,
|
||||
dataset + '.json')
|
||||
if not osp.exists(local_result_json):
|
||||
if args.mode == 'viz':
|
||||
continue
|
||||
raise ValueError(
|
||||
'invalid file: {}'.format(local_result_json))
|
||||
with open(local_result_json, 'r') as f:
|
||||
this_result = json.load(f)
|
||||
f.close()
|
||||
|
||||
# get prediction list
|
||||
local_prediction_path = osp.join(work_dict, 'predictions',
|
||||
model)
|
||||
local_prediction_regex = \
|
||||
rf'^{re.escape(dataset)}(?:_\d+)?\.json$'
|
||||
local_prediction_json = find_files_by_regex(
|
||||
local_prediction_path, local_prediction_regex)
|
||||
if not check_filenames(
|
||||
dataset,
|
||||
local_prediction_json) and args.mode != 'viz':
|
||||
raise ValueError('invalid filelist: {}'.format(
|
||||
local_prediction_json))
|
||||
|
||||
this_prediction = []
|
||||
for prediction_json in local_prediction_json:
|
||||
with open(
|
||||
osp.join(local_prediction_path,
|
||||
prediction_json), 'r') as f:
|
||||
this_prediction_load_json = json.load(f)
|
||||
f.close()
|
||||
for prekey in this_prediction_load_json.keys():
|
||||
this_prediction.append(
|
||||
this_prediction_load_json[prekey])
|
||||
|
||||
# get config dict
|
||||
model_cfg = [
|
||||
i for i in cfg['models']
|
||||
if model_abbr_from_cfg(i) == model
|
||||
][0]
|
||||
dataset_cfg = [
|
||||
i for i in cfg['datasets']
|
||||
if dataset_abbr_from_cfg(i) == dataset
|
||||
][0]
|
||||
this_cfg = {'models': model_cfg, 'datasets': dataset_cfg}
|
||||
|
||||
# dict combine
|
||||
data_model_results = {
|
||||
'predictions': this_prediction,
|
||||
'results': this_result,
|
||||
'cfg': this_cfg
|
||||
}
|
||||
with open(osp.join(result_path, result_file_name),
|
||||
'w') as f:
|
||||
json.dump(data_model_results,
|
||||
f,
|
||||
ensure_ascii=False,
|
||||
indent=4)
|
||||
f.close()
|
||||
print(
|
||||
'successfully save result of {} with {} to the station'
|
||||
.format(dataset, model))
|
||||
return True
|
||||
|
||||
# subjective processing
|
||||
else:
|
||||
model_list = [model for model in cfg['models']]
|
||||
judge_list = [judge_model for judge_model in cfg['judge_models']]
|
||||
model_pair_list = [[
|
||||
deal_with_judge_model_abbr(model, judge_model)
|
||||
for judge_model in judge_list
|
||||
] for model in model_list]
|
||||
|
||||
dataset_list = [[
|
||||
dataset_abbr_from_cfg(dataset),
|
||||
[dataset_abbr_from_cfg(base) for base in dataset['base_models']]
|
||||
] if 'base_models' in dataset.keys() else
|
||||
[dataset_abbr_from_cfg(dataset), ['']]
|
||||
for dataset in cfg['datasets']]
|
||||
|
||||
rs_exist_results = []
|
||||
if 'rs_exist_results' in cfg.keys():
|
||||
rs_exist_results = cfg['rs_exist_results']
|
||||
|
||||
for pair_of_dataset_and_base in dataset_list:
|
||||
dataset, base_list = pair_of_dataset_and_base[
|
||||
0], pair_of_dataset_and_base[1]
|
||||
|
||||
result_path = osp.join(station_path, dataset)
|
||||
if not osp.exists(result_path):
|
||||
os.makedirs(result_path)
|
||||
|
||||
for base_model in base_list:
|
||||
base_model_name = base_model
|
||||
if base_model_name != '':
|
||||
base_model_name += '_'
|
||||
for model_pair_sub_list in model_pair_list:
|
||||
for model_pair in model_pair_sub_list:
|
||||
model = model_abbr_from_cfg(model_pair[0])
|
||||
model_result = model_abbr_from_cfg(model_pair)
|
||||
if ([model, dataset] in rs_exist_results
|
||||
and not args.station_overwrite):
|
||||
continue
|
||||
result_file_name = (base_model_name + model_result +
|
||||
'.json')
|
||||
if osp.exists(osp.join(result_path, result_file_name)
|
||||
) and not args.station_overwrite:
|
||||
print('{} at {} already exists'.format(
|
||||
result_file_name, result_path))
|
||||
continue
|
||||
else:
|
||||
# get result dict
|
||||
local_result_path = osp.join(
|
||||
work_dict, 'results',
|
||||
base_model_name + model_result)
|
||||
local_result_json = osp.join(
|
||||
local_result_path, dataset + '.json')
|
||||
if not osp.exists(local_result_json):
|
||||
if args.mode == 'viz':
|
||||
continue
|
||||
raise ValueError('invalid file: {}'.format(
|
||||
local_result_json))
|
||||
with open(local_result_json, 'r') as f:
|
||||
this_result = json.load(f)
|
||||
f.close()
|
||||
|
||||
# get prediction list
|
||||
local_prediction_path = osp.join(
|
||||
work_dict, 'predictions', model)
|
||||
local_prediction_regex = \
|
||||
rf'^{re.escape(dataset)}(?:_\d+)?\.json$'
|
||||
local_prediction_json = find_files_by_regex(
|
||||
local_prediction_path, local_prediction_regex)
|
||||
if not check_filenames(dataset,
|
||||
local_prediction_json
|
||||
) and args.mode != 'viz':
|
||||
raise ValueError('invalid filelist: {}'.format(
|
||||
local_prediction_json))
|
||||
|
||||
this_prediction = []
|
||||
for prediction_json in local_prediction_json:
|
||||
with open(
|
||||
osp.join(local_prediction_path,
|
||||
prediction_json), 'r') as f:
|
||||
this_prediction_load_json = json.load(f)
|
||||
f.close()
|
||||
for prekey in this_prediction_load_json.keys():
|
||||
this_prediction.append(
|
||||
this_prediction_load_json[prekey])
|
||||
|
||||
# get config dict
|
||||
model_cfg = [
|
||||
i for i in cfg['models']
|
||||
if model_abbr_from_cfg(i) == model
|
||||
][0]
|
||||
dataset_cfg = [
|
||||
i for i in cfg['datasets']
|
||||
if dataset_abbr_from_cfg(i) == dataset
|
||||
][0]
|
||||
judge_model_cfg = [
|
||||
i for i in cfg['judge_models']
|
||||
if 'judged-by--' + model_abbr_from_cfg(i) ==
|
||||
model_abbr_from_cfg(model_pair[1])
|
||||
]
|
||||
|
||||
this_cfg = {
|
||||
'models': model_cfg,
|
||||
'datasets': dataset_cfg,
|
||||
'judge_models': judge_model_cfg
|
||||
}
|
||||
|
||||
# dict combine
|
||||
data_model_results = {
|
||||
'predictions': this_prediction,
|
||||
'results': this_result,
|
||||
'cfg': this_cfg
|
||||
}
|
||||
|
||||
with open(osp.join(result_path, result_file_name),
|
||||
'w') as f:
|
||||
json.dump(data_model_results,
|
||||
f,
|
||||
ensure_ascii=False,
|
||||
indent=4)
|
||||
f.close()
|
||||
print('successfully save result: {} at {} to the'
|
||||
'station'.format(result_file_name,
|
||||
result_path))
|
||||
return True
|
||||
|
||||
|
||||
def read_from_station(cfg, args):
|
||||
|
||||
assert args.station_path is not None or cfg.get('station_path') is not None
|
||||
if args.station_path is not None:
|
||||
station_path = args.station_path
|
||||
else:
|
||||
station_path = cfg.get('station_path')
|
||||
|
||||
# objective check
|
||||
if 'judge_models' not in cfg.keys():
|
||||
model_list = [model_abbr_from_cfg(model) for model in cfg['models']]
|
||||
dataset_list = [
|
||||
dataset_abbr_from_cfg(dataset) for dataset in cfg['datasets']
|
||||
]
|
||||
|
||||
existing_results_list = []
|
||||
result_local_path = osp.join(cfg['work_dir'], 'results')
|
||||
if not osp.exists(result_local_path):
|
||||
os.makedirs(result_local_path)
|
||||
|
||||
for dataset in dataset_list:
|
||||
for model in model_list:
|
||||
result_file_path = osp.join(station_path, dataset,
|
||||
model + '.json')
|
||||
if not osp.exists(result_file_path):
|
||||
print('do not find result file: {} with {} at station'.
|
||||
format(model, dataset))
|
||||
continue
|
||||
else:
|
||||
print('find result file: {} with {} at station'.format(
|
||||
model, dataset))
|
||||
with open(result_file_path, 'r') as f:
|
||||
download_json = json.load(f)
|
||||
f.close()
|
||||
existing_results_list.append({
|
||||
'combination': [model, dataset],
|
||||
'file':
|
||||
download_json
|
||||
})
|
||||
|
||||
# save results to local
|
||||
for i in existing_results_list:
|
||||
this_result = i['file']['results']
|
||||
this_result_local_path = osp.join(result_local_path,
|
||||
i['combination'][0])
|
||||
if not osp.exists(this_result_local_path):
|
||||
os.makedirs(this_result_local_path)
|
||||
this_result_local_file_path = osp.join(
|
||||
this_result_local_path, i['combination'][1] + '.json')
|
||||
if osp.exists(this_result_local_file_path):
|
||||
continue
|
||||
with open(this_result_local_file_path, 'w') as f:
|
||||
json.dump(this_result, f, ensure_ascii=False, indent=4)
|
||||
f.close()
|
||||
|
||||
return existing_results_list
|
||||
|
||||
# subjective check
|
||||
else:
|
||||
model_list = [model for model in cfg['models']]
|
||||
judge_list = [judge_model for judge_model in cfg['judge_models']]
|
||||
model_pair_list = [[
|
||||
deal_with_judge_model_abbr(model, judge_model)
|
||||
for judge_model in judge_list
|
||||
] for model in model_list]
|
||||
|
||||
dataset_list = [[
|
||||
dataset_abbr_from_cfg(dataset),
|
||||
[dataset_abbr_from_cfg(base) for base in dataset['base_models']]
|
||||
] if 'base_models' in dataset.keys() else
|
||||
[dataset_abbr_from_cfg(dataset), ['']]
|
||||
for dataset in cfg['datasets']]
|
||||
|
||||
existing_results_list = []
|
||||
result_local_path = osp.join(cfg['work_dir'], 'results')
|
||||
if not osp.exists(result_local_path):
|
||||
os.makedirs(result_local_path)
|
||||
|
||||
for pair_of_dataset_and_base in dataset_list:
|
||||
dataset, base_list = pair_of_dataset_and_base[
|
||||
0], pair_of_dataset_and_base[1]
|
||||
|
||||
for model_pair_sub_list in model_pair_list:
|
||||
result_file_path_list_origin = []
|
||||
for model_pair in model_pair_sub_list:
|
||||
model_result = model_abbr_from_cfg(model_pair)
|
||||
for base_model in base_list:
|
||||
base_model_name = base_model
|
||||
if base_model_name != '':
|
||||
base_model_name += '_'
|
||||
|
||||
result_file_path_list_origin.append(
|
||||
osp.join(station_path, dataset,
|
||||
base_model_name + model_result + '.json'))
|
||||
|
||||
result_file_path_list = [
|
||||
result_file_path
|
||||
for result_file_path in result_file_path_list_origin
|
||||
if osp.exists(result_file_path)
|
||||
]
|
||||
model = model_abbr_from_cfg(model_pair_sub_list[0][0])
|
||||
|
||||
# save all parts of results to local
|
||||
for result_file_path in result_file_path_list:
|
||||
with open(result_file_path, 'r') as f:
|
||||
this_result = json.load(f)['results']
|
||||
f.close()
|
||||
this_result_local_path = osp.join(
|
||||
result_local_path,
|
||||
osp.splitext(osp.basename(result_file_path))[0])
|
||||
if not osp.exists(this_result_local_path):
|
||||
os.makedirs(this_result_local_path)
|
||||
this_result_local_file_path = osp.join(
|
||||
this_result_local_path, dataset + '.json')
|
||||
if osp.exists(this_result_local_file_path):
|
||||
continue
|
||||
with open(this_result_local_file_path, 'w') as f:
|
||||
json.dump(this_result, f, ensure_ascii=False, indent=4)
|
||||
f.close()
|
||||
|
||||
# check whether complete
|
||||
if len(result_file_path_list) == len(
|
||||
result_file_path_list_origin):
|
||||
print('find complete results of {} with {} at station'.
|
||||
format(model, dataset))
|
||||
existing_results_list.append({
|
||||
'combination': [model, dataset],
|
||||
'file':
|
||||
result_file_path_list
|
||||
})
|
||||
else:
|
||||
print('results of {} with {} at station is not complete'.
|
||||
format(model, dataset))
|
||||
|
||||
return existing_results_list
|
||||
|
||||
|
||||
def find_files_by_regex(directory, pattern):
|
||||
|
||||
regex = re.compile(pattern)
|
||||
|
||||
matched_files = []
|
||||
for filename in os.listdir(directory):
|
||||
if regex.match(filename):
|
||||
matched_files.append(filename)
|
||||
|
||||
return matched_files
|
||||
|
||||
|
||||
def check_filenames(x, filenames):
|
||||
|
||||
if not filenames:
|
||||
return False
|
||||
|
||||
single_pattern = re.compile(rf'^{re.escape(x)}\.json$')
|
||||
numbered_pattern = re.compile(rf'^{re.escape(x)}_(\d+)\.json$')
|
||||
|
||||
is_single = all(single_pattern.match(name) for name in filenames)
|
||||
is_numbered = all(numbered_pattern.match(name) for name in filenames)
|
||||
|
||||
if not (is_single or is_numbered):
|
||||
return False
|
||||
|
||||
if is_single:
|
||||
return len(filenames) == 1
|
||||
|
||||
if is_numbered:
|
||||
numbers = []
|
||||
for name in filenames:
|
||||
match = numbered_pattern.match(name)
|
||||
if match:
|
||||
numbers.append(int(match.group(1)))
|
||||
|
||||
if sorted(numbers) != list(range(len(numbers))):
|
||||
return False
|
||||
|
||||
return True
|
@ -37,7 +37,7 @@ rouge_score
|
||||
sacrebleu
|
||||
scikit_learn==1.5.0
|
||||
seaborn
|
||||
sentence_transformers==2.2.2
|
||||
sentence_transformers
|
||||
tabulate
|
||||
tiktoken
|
||||
timeout_decorator
|
||||
|
Loading…
Reference in New Issue
Block a user