This commit is contained in:
kangreen0210 2025-03-07 16:30:36 +00:00
commit 89bbf13f5a
42 changed files with 2075 additions and 236 deletions

View File

@ -57,6 +57,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
- **\[2025.01.16\]** We now support the [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) model which has enhanced performance on reasoning and knowledge-intensive tasks.
- **\[2024.12.17\]** We have provided the evaluation script for the December [CompassAcademic](examples/eval_academic_leaderboard_202412.py), which allows users to easily reproduce the official evaluation results by configuring it.

View File

@ -57,6 +57,7 @@
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程,请查看 [评估推理模型](docs/en/user_guides/deepseek_r1.md) 了解更多详情!🔥🔥🔥
- **\[2025.02.15\]** 我们新增了两个实用的评测工具用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情!🔥🔥🔥
- **\[2025.01.16\]** 我们现已支持 [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) 模型,该模型在推理、知识类任务上取得同量级最优性能,欢迎尝试。
- **\[2024.12.17\]** 我们提供了12月CompassAcademic学术榜单评估脚本 [CompassAcademic](configs/eval_academic_leaderboard_202412.py),你可以通过简单地配置复现官方评测结果。

View File

@ -399,6 +399,11 @@
category: Math
paper: https://proceedings.mlr.press/v202/gao23f/gao23f.pdf
configpath: opencompass/configs/datasets/gsm_hard
- hle:
name: HLE(Humanity's Last Exam)
category: Reasoning
paper: https://lastexam.ai/paper
configpath: opencompass/configs/datasets/HLE
- hellaswag:
name: HellaSwag
category: Reasoning

View File

@ -0,0 +1,65 @@
# Evaluation Results Persistence
## Introduction
Normally, the evaluation results of OpenCompass will be saved to your work directory. But in some cases, there may be a need for data sharing among users or quickly browsing existing public evaluation results. Therefore, we provide an interface that can quickly transfer evaluation results to external public data stations, and on this basis, provide functions such as uploading, overwriting, and reading.
## Quick Start
### Uploading
By adding `args` to the evaluation command or adding configuration in the Eval script, the results of evaluation can be stored in the path you specify. Here are the examples:
(Approach 1) Add an `args` option to the command and specify your public path address.
```bash
opencompass ... -sp '/your_path'
```
(Approach 2) Add configuration in the Eval script.
```pythonE
station_path = '/your_path'
```
### Overwriting
The above storage method will first determine whether the same task result already exists in the data station based on the `abbr` attribute in the model and dataset configuration before uploading data. If results already exists, cancel this storage. If you need to update these results, please add the `station-overwrite` option to the command, here is an example:
```bash
opencompass ... -sp '/your_path' --station-overwrite
```
### Reading
You can directly read existing results from the data station to avoid duplicate evaluation tasks. The read results will directly participate in the 'summarize' step. When using this configuration, only tasks that do not store results in the data station will be initiated. Here is an example:
```bash
opencompass ... -sp '/your_path' --read-from-station
```
### Command Combination
1. Only upload the results under your latest working directory to the data station, without supplementing tasks that missing results:
```bash
opencompass ... -sp '/your_path' -r latest -m viz
```
## Storage Format of the Data Station
In the data station, the evaluation results are stored as `json` files for each `model-dataset` pair. The specific directory form is `/your_path/dataset_name/model_name.json `. Each `json` file stores a dictionary corresponding to the results, including `predictions`, `results`, and `cfg`, here is an example:
```pythonE
Result = {
'predictions': List[Dict],
'results': Dict,
'cfg': Dict = {
'models': Dict,
'datasets': Dict,
(Only subjective datasets)'judge_models': Dict
}
}
```
Among this three keys, `predictions` records the predictions of the model on each item of data in the dataset. `results` records the total score of the model on the dataset. `cfg` records detailed configurations of the model and the dataset in this evaluation task.

View File

@ -39,6 +39,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
user_guides/evaluation.md
user_guides/experimentation.md
user_guides/metrics.md
user_guides/deepseek_r1.md
.. _Prompt:
.. toctree::
@ -66,6 +67,7 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
advanced_guides/code_eval.md
advanced_guides/code_eval_service.md
advanced_guides/subjective_evaluation.md
advanced_guides/persistence.md
.. _Tools:
.. toctree::

View File

@ -0,0 +1,192 @@
# Tutorial for Evaluating Reasoning Models
OpenCompass provides an evaluation tutorial for DeepSeek R1 series reasoning models (mathematical datasets).
- At the model level, we recommend using the sampling approach to reduce repetitions caused by greedy decoding
- For datasets with limited samples, we employ multiple evaluation runs and take the average
- For answer validation, we utilize LLM-based verification to reduce misjudgments from rule-based evaluation
## Installation and Preparation
Please follow OpenCompass's installation guide.
## Evaluation Configuration Setup
We provide example configurations in `examples/eval_deepseek_r1.py`. Below is the configuration explanation:
### Configuration Interpretation
#### 1. Dataset and Validator Configuration
```python
# Configuration supporting multiple runs (example)
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
# LLM validator configuration. Users need to deploy API services via LMDeploy/vLLM/SGLang or use OpenAI-compatible endpoints
verifier_cfg = dict(
abbr='qwen2-5-32B-Instruct',
type=OpenAISDK,
path='Qwen/Qwen2.5-32B-Instruct', # Replace with actual path
key='YOUR_API_KEY', # Use real API key
openai_api_base=['http://your-api-endpoint'], # Replace with API endpoint
query_per_second=16,
batch_size=1024,
temperature=0.001,
max_out_len=16384
)
# Apply validator to all datasets
for item in datasets:
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg
```
#### 2. Model Configuration
We provided an example of evaluation based on LMDeploy as the reasoning model backend, users can modify path (i.e., HF path)
```python
# LMDeploy model configuration example
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-7b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768
),
max_seq_len=32768,
batch_size=64,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
# Extendable 14B/32B configurations...
]
```
#### 3. Evaluation Process Configuration
```python
# Inference configuration
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
# Evaluation configuration
eval = dict(
partitioner=dict(type=NaivePartitioner, n=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)))
```
#### 4. Summary Configuration
```python
# Multiple runs results average configuration
summary_groups = [
{
'name': 'AIME2024-Aveage8',
'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
},
# Other dataset average configurations...
]
summarizer = dict(
dataset_abbrs=[
['AIME2024-Aveage8', 'naive_average'],
# Other dataset metrics...
],
summary_groups=summary_groups
)
# Work directory configuration
work_dir = "outputs/deepseek_r1_reasoning"
```
## Evaluation Execution
### Scenario 1: Model loaded on 1 GPU, data evaluated by 1 worker, using a total of 1 GPU
```bash
opencompass examples/eval_deepseek_r1.py --debug --dump-eval-details
```
Evaluation logs will be output in the command line.
### Scenario 2: Model loaded on 1 GPU, data evaluated by 8 workers, using a total of 8 GPUs
You need to modify the `infer` configuration in the configuration file and set `num_worker` to 8
```python
# Inference configuration
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
```
At the same time, remove the `--debug` parameter from the evaluation command
```bash
opencompass examples/eval_deepseek_r1.py --dump-eval-details
```
In this mode, OpenCompass will use multithreading to start `$num_worker` tasks. Specific logs will not be displayed in the command line, instead, detailed evaluation logs will be shown under `$work_dir`.
### Scenario 3: Model loaded on 2 GPUs, data evaluated by 4 workers, using a total of 8 GPUs
Note that in the model configuration, `num_gpus` in `run_cfg` needs to be set to 2 (if using an inference backend, parameters such as `tp` in LMDeploy also need to be modified accordingly to 2), and at the same time, set `num_worker` in the `infer` configuration to 4
```python
models += [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-14b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768),
max_seq_len=32768,
max_out_len=32768,
batch_size=128,
run_cfg=dict(num_gpus=2),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
]
```
```python
# Inference configuration
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=4),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
```
### Evaluation Results
The evaluation results are displayed as follows:
```bash
dataset version metric mode deepseek-r1-distill-qwen-7b-turbomind ---------------------------------- --------- ------------- ------ --------------------------------------- MATH - - - AIME2024-Aveage8 - naive_average gen 56.25
```
## Performance Baseline
Since the model uses Sampling for decoding, and the AIME dataset size is small, there may still be a performance fluctuation of 1-3 points even when averaging over 8 evaluations.
| Model | Dataset | Metric | Value |
| ---------------------------- | -------- | -------- | ----- |
| DeepSeek-R1-Distill-Qwen-7B | AIME2024 | Accuracy | 56.3 |
| DeepSeek-R1-Distill-Qwen-14B | AIME2024 | Accuracy | 74.2 |
| DeepSeek-R1-Distill-Qwen-32B | AIME2024 | Accuracy | 74.2 |

View File

@ -0,0 +1,65 @@
# 评测结果持久化
## 介绍
通常情况下OpenCompass的评测结果将会保存到工作目录下。 但在某些情况下,可能会产生用户间的数据共享,以及快速查看已有的公共评测结果等需求。 因此,我们提供了一个能够将评测结果快速转存到外部公共数据站的接口,并且在此基础上提供了对数据站的上传、更新、读取等功能。
## 快速开始
### 向数据站存储数据
通过在CLI评测指令中添加`args`或在Eval脚本中添加配置即可将本次评测结果存储到您所指定的路径示例如下
方式1在指令中添加`args`选项并指定你的公共路径地址。
```bash
opencompass ... -sp '/your_path'
```
方式2在Eval脚本中添加配置。
```pythonE
station_path = '/your_path'
```
### 向数据站更新数据
上述存储方法在上传数据前会首先根据模型和数据集配置中的`abbr`属性来判断数据站中是否已有相同任务结果。若已有结果,则取消本次存储。如果您需要更新这部分结果,请在指令中添加`station-overwrite`选项,示例如下:
```bash
opencompass ... -sp '/your_path' --station-overwrite
```
### 读取数据站中已有的结果
您可以直接从数据站中读取已有的结果,以避免重复进行评测任务。读取到的结果会直接参与到`summarize`步骤。采用该配置时,仅有数据站中未存储结果的任务会被启动。示例如下:
```bash
opencompass ... -sp '/your_path' --read-from-station
```
### 指令组合
1. 仅向数据站上传最新工作目录下结果,不补充运行缺失结果的任务:
```bash
opencompass ... -sp '/your_path' -r latest -m viz
```
## 数据站存储格式
在数据站中,评测结果按照每个`model-dataset`对的结果存储为`json`文件。具体的目录组织形式为`/your_path/dataset_name/model_name.json`。每个`json`文件都存储了对应结果的字典,包括`predictions`、`results`以及`cfg`三个子项,具体示例如下:
```pythonE
Result = {
'predictions': List[Dict],
'results': Dict,
'cfg': Dict = {
'models': Dict,
'datasets': Dict,
(Only subjective datasets)'judge_models': Dict
}
}
```
其中,`predictions`记录了模型对数据集中每一条数据的prediction的结果`results`记录了模型在该数据集上的评分,`cfg`记录了该评测任务中模型和数据集的详细配置。

View File

@ -40,6 +40,7 @@ OpenCompass 上手路线
user_guides/evaluation.md
user_guides/experimentation.md
user_guides/metrics.md
user_guides/deepseek_r1.md
.. _提示词:
.. toctree::
@ -66,6 +67,7 @@ OpenCompass 上手路线
advanced_guides/code_eval.md
advanced_guides/code_eval_service.md
advanced_guides/subjective_evaluation.md
advanced_guides/persistence.md
.. _工具:
.. toctree::

View File

@ -0,0 +1,192 @@
# 强推理模型评测教程
OpenCompass提供针对DeepSeek R1系列推理模型的评测教程数学数据集
- 在模型层面我们建议使用Sampling方式以减少因为Greedy评测带来的大量重复
- 在数据集层面,我们对数据量较小的评测基准,使用多次评测并取平均的方式。
- 在答案验证层面为了减少基于规则评测带来的误判我们统一使用基于LLM验证的方式进行评测。
## 安装和准备
请按OpenCompass安装教程进行安装。
## 构建评测配置
我们在 `example/eval_deepseek_r1.py` 中提供了示例配置,以下对评测配置进行解读
### 评测配置解读
#### 1. 数据集与验证器配置
```python
# 支持多运行次数的数据集配置(示例)
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
# 设置LLM验证器 用户需事先通过LMDeploy/vLLM/SGLang等工具启动API 评测服务器或者直接使用兼容OpenAI标准接口的模型服务
verifier_cfg = dict(
abbr='qwen2-5-32B-Instruct',
type=OpenAISDK,
path='Qwen/Qwen2.5-32B-Instruct', # 需替换实际路径
key='YOUR_API_KEY', # 需替换真实API Key
openai_api_base=['http://your-api-endpoint'], # 需替换API地址
query_per_second=16,
batch_size=1024,
temperature=0.001,
max_out_len=16384
)
# 应用验证器到所有数据集
for item in datasets:
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg
```
#### 2. 模型配置
我们提供了基于LMDeploy作为推理后端的评测示例用户可以通过修改path即HF路径
```python
# LMDeploy模型配置示例
models = [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-7b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768
),
max_seq_len=32768,
batch_size=64,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
# 可扩展14B/32B配置...
]
```
#### 3. 评估流程配置
```python
# 推理配置
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
# 评估配置
eval = dict(
partitioner=dict(type=NaivePartitioner, n=8),
runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)))
```
#### 4. 结果汇总配置
```python
# 多运行结果平均配置
summary_groups = [
{
'name': 'AIME2024-Aveage8',
'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
},
# 其他数据集平均配置...
]
summarizer = dict(
dataset_abbrs=[
['AIME2024-Aveage8', 'naive_average'],
# 其他数据集指标...
],
summary_groups=summary_groups
)
# 工作目录设置
work_dir = "outputs/deepseek_r1_reasoning"
```
## 执行评测
### 场景1模型1卡加载数据1个worker评测共使用1个GPU
```bash
opencompass example/eval_deepseek_r1.py --debug --dump-eval-details
```
评测日志会在命令行输出。
### 场景2模型1卡加载数据8个worker评测共使用8个GPU
需要修改配置文件中的infer配置将num_worker设置为8
```python
# 推理配置
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=1),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
```
同时评测命令去掉`--debug`参数
```bash
opencompass example/eval_deepseek_r1.py --dump-eval-details
```
此模式下OpenCompass将使用多线程启动`$num_worker`个任务,命令行不展示具体日志,具体的评测日志将会在`$work_dir`下中展示。
### 场景3模型2卡加载数据4个worker评测共使用8个GPU
需要注意模型配置中,`run_cfg`中的`num_gpus`需要设置为2(如使用推理后端则推理后端的参数也需要同步修改比如LMDeploy中的tp需要设置为2),同时修改`infer`配置中的`num_worker`为4
```python
models += [
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-14b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768),
max_seq_len=32768,
max_out_len=32768,
batch_size=128,
run_cfg=dict(num_gpus=2),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
]
```
```python
# 推理配置
infer = dict(
partitioner=dict(type=NumWorkerPartitioner, num_worker=4),
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask))
```
### 评测结果
评测结果展示如下:
```bash
dataset version metric mode deepseek-r1-distill-qwen-7b-turbomind ---------------------------------- --------- ------------- ------ --------------------------------------- MATH - - - AIME2024-Aveage8 - naive_average gen 56.25
```
## 性能基线参考
由于模型使用Sampling进行解码同时AIME数据量较小使用8次评测取平均情况下仍会出现1-3分的性能抖动
| 模型 | 数据集 | 指标 | 数值 |
| ---------------------------- | -------- | -------- | ---- |
| DeepSeek-R1-Distill-Qwen-7B | AIME2024 | Accuracy | 56.3 |
| DeepSeek-R1-Distill-Qwen-14B | AIME2024 | Accuracy | 74.2 |
| DeepSeek-R1-Distill-Qwen-32B | AIME2024 | Accuracy | 74.2 |

View File

@ -0,0 +1,212 @@
# Support AIME-2024 with Repeat8
# Support MATH-500
# Support OlympiadBench
# Support OmniMath
# Support LiveMathBench-202412-Hard
import os.path as osp
from itertools import product
from opencompass.models import OpenAISDK
from mmengine.config import read_base
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
from opencompass.runners import LocalRunner
from opencompass.models import (
TurboMindModelwithChatTemplate,
)
#######################################################################
# PART 1 Datasets List #
#######################################################################
with read_base():
# You can comment out the datasets you don't want to evaluate
# Datasets
# from opencompass.configs.datasets.math.math_prm800k_500_llmverify_gen_6ff468 import math_datasets # 1 Run
from opencompass.configs.datasets.aime2024.aime2024_llmverify_repeat8_gen_e8fcee import aime2024_datasets # 8 Run
# from opencompass.configs.datasets.OlympiadBench.OlympiadBench_0shot_llmverify_gen_be8b13 import olympiadbench_datasets
# from opencompass.configs.datasets.omni_math.omni_math_llmverify_gen_ccf9c0 import omnimath_datasets # 1 Run
# from opencompass.configs.datasets.livemathbench.livemathbench_hard_custom_llmverify_gen_85d0ef import livemathbench_datasets
# Summarizer
from opencompass.configs.summarizers.groups.OlympiadBench import OlympiadBenchMath_summary_groups
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')),
[],
)
# Set LLM Verifier used for each dataset
verifier_cfg = dict(
abbr='qwen2-5-32B-Instruct',
type=OpenAISDK,
path='Qwen/Qwen2.5-32B-Instruct', # You need to set your own judge model path
key='sk-1234', # You need to set your own API key
openai_api_base=[
'http://172.30.56.1:4000/v1', # You need to set your own API base
],
meta_template=dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
),
query_per_second=16,
batch_size=1024,
temperature=0.001,
tokenizer_path='gpt-4o-2024-05-13',
verbose=True,
max_out_len=16384,
# max_seq_len=32768,
max_seq_len=49152,
)
for item in datasets:
# item['infer_cfg']['inferencer']['max_out_len'] = 32768 # You can unset this line if you want to avoid length cutoff
if 'judge_cfg' in item['eval_cfg']['evaluator']:
item['eval_cfg']['evaluator']['judge_cfg'] = verifier_cfg
#######################################################################
# PART 2 Model List #
#######################################################################
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
models += [
# You can comment out the models you don't want to evaluate
# All models use sampling mode
dict(
type=TurboMindModelwithChatTemplate,
abbr='deepseek-r1-distill-qwen-7b-turbomind',
path='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
engine_config=dict(session_len=32768, max_batch_size=128, tp=1),
gen_config=dict(
do_sample=True,
temperature=0.6,
top_p=0.95,
max_new_tokens=32768),
max_seq_len=32768,
max_out_len=32768,
batch_size=64,
run_cfg=dict(num_gpus=1),
pred_postprocessor=dict(type=extract_non_reasoning_content)
),
# dict(
# type=TurboMindModelwithChatTemplate,
# abbr='deepseek-r1-distill-qwen-14b-turbomind',
# path='deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
# engine_config=dict(session_len=32768, max_batch_size=128, tp=2),
# gen_config=dict(
# do_sample=True,
# temperature=0.6,
# top_p=0.95,
# max_new_tokens=32768),
# max_seq_len=32768,
# max_out_len=32768,
# batch_size=128,
# run_cfg=dict(num_gpus=2),
# pred_postprocessor=dict(type=extract_non_reasoning_content)
# ),
# dict(
# type=TurboMindModelwithChatTemplate,
# abbr='deepseek-r1-distill-qwen-32b-turbomind',
# path='deepseek-ai/DeepSeek-R1-Distill-Qwen-32B',
# engine_config=dict(session_len=32768, max_batch_size=128, tp=4),
# gen_config=dict(
# do_sample=True,
# temperature=0.6,
# top_p=0.95,
# max_new_tokens=16384),
# max_seq_len=32768,
# max_out_len=16384,
# batch_size=128,
# run_cfg=dict(num_gpus=4),
# pred_postprocessor=dict(type=extract_non_reasoning_content)
# ),
]
#######################################################################
# PART 3 Inference/Evaluation #
#######################################################################
# Inference configuration
infer = dict(
partitioner=dict(
type=NumWorkerPartitioner,
num_worker=1
# Similar with data-parallelism, how many workers for evaluation,
# each worker will evaluate a part of the dataset. Total GPUs = num_worker * num_gpus_per_worker
# For example, If you have 8 GPUs, for 7B model using 1 GPU for one instance, you can set num_worker=8
# to max-utilize the GPUs.
# If you have 8 GPUs, for 14B model using 2 GPUs for one instance, you can set num_worker=4
),
runner=dict(
type=LocalRunner,
task=dict(type=OpenICLInferTask)
),
)
# Evaluation configuration
eval = dict(
partitioner=dict(
type=NaivePartitioner, n=8
),
runner=dict(
type=LocalRunner,
task=dict(
type=OpenICLEvalTask)
),
)
#######################################################################
# PART 4 Summarizer #
#######################################################################
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []
)
summary_groups.extend([
{
'name': 'AIME2024-Aveage8',
'subsets':[[f'aime2024-run{idx}', 'accuracy'] for idx in range(8)]
},
{
'name': 'LiveMathBench-v202412-Hard-Aveage8',
'subsets':[[
f'livemathbench_hard_custom_{split}_run{run_idx}', 'accuracy']
for split, run_idx in product(['hard_cn', 'hard_en'], range(8))
]
}
])
# Summarizer
summarizer = dict(
dataset_abbrs=[
'MATH',
# ['LiveMathBench-k1-n1', 'pass@1'],
# ['LiveMathBench-v202412-greedy', 'G-Pass@1_0.0'],
# ['aime2024', 'accuracy'],
['math_prm800k_500-llmjudge', 'accuracy'],
['AIME2024-Aveage8', 'naive_average'],
['LiveMathBench-v202412-Hard-Aveage8', 'naive_average'],
['OlympiadBenchMath', 'accuracy'],
['OmniMath', 'accuracy'],
],
summary_groups=summary_groups,
)
#######################################################################
# PART 5 Utils #
#######################################################################
work_dir = 'outputs/deepseek_r1_reasoning'

View File

@ -1 +1 @@
__version__ = '0.4.0'
__version__ = '0.4.1'

View File

@ -12,7 +12,8 @@ from mmengine.config import Config, DictAction
from opencompass.registry import PARTITIONERS, RUNNERS, build_from_cfg
from opencompass.runners import SlurmRunner
from opencompass.summarizers import DefaultSummarizer
from opencompass.utils import LarkReporter, get_logger
from opencompass.utils import (LarkReporter, get_logger, read_from_station,
save_to_station)
from opencompass.utils.run import (fill_eval_cfg, fill_infer_cfg,
get_config_from_arg)
@ -127,6 +128,27 @@ def parse_args():
'correctness of each sample, bpb, etc.',
action='store_true',
)
parser.add_argument('-sp',
'--station-path',
help='Path to your results station.',
type=str,
default=None,
)
parser.add_argument('--station-overwrite',
help='Whether to overwrite the results at station.',
action='store_true',
)
parser.add_argument(
'--read-from-station',
help='Whether to save the evaluation results to the '
'data station.',
action='store_true',
)
# set srun args
slurm_parser = parser.add_argument_group('slurm_args')
parse_slurm_args(slurm_parser)
@ -260,6 +282,12 @@ def main():
# types cannot be serialized
cfg = Config.fromfile(output_config_path, format_python_code=False)
# get existed results from station
if args.read_from_station:
existing_results_list = read_from_station(cfg, args)
rs_exist_results = [comb['combination'] for comb in existing_results_list]
cfg['rs_exist_results'] = rs_exist_results
# report to lark bot if specify --lark
if not args.lark:
cfg['lark_bot_url'] = None
@ -267,6 +295,7 @@ def main():
content = f'{getpass.getuser()}\'s task has been launched!'
LarkReporter(cfg['lark_bot_url']).post(content)
# infer
if args.mode in ['all', 'infer']:
# When user have specified --slurm or --dlc, or have not set
# "infer" in config, we will provide a default configuration
@ -348,6 +377,10 @@ def main():
else:
runner(tasks)
# save to station
if args.station_path is not None or cfg.get('station_path') is not None:
save_to_station(cfg, args)
# visualize
if args.mode in ['all', 'eval', 'viz']:
summarizer_cfg = cfg.get('summarizer', {})

View File

@ -0,0 +1,5 @@
from mmengine.config import read_base
with read_base():
# Default use LLM as a judge
from .hle_llmverify_gen_6ff468 import hle_datasets # noqa: F401, F403

View File

@ -0,0 +1,91 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.datasets import HLEDataset
# ----------------------------- Detailed Config -----------------------------
math_reader_cfg = dict(input_columns=['problem'], output_column='answer')
math_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{problem}\nRemember to put your final answer within \\boxed{}.'),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{problem}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
# Evaluation configuration
math_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
],
round=[
dict(
role='HUMAN',
prompt = GRADER_TEMPLATE
),
]),
),
dataset_cfg=dict(
type=HLEDataset,
path='cais/hle',
reader_cfg=math_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
)
hle_datasets = [
dict(
type=HLEDataset,
abbr='hle_llmjudge',
path='cais/hle',
reader_cfg=math_reader_cfg,
infer_cfg=math_infer_cfg,
eval_cfg=math_eval_cfg,
)
]

View File

@ -0,0 +1,105 @@
from mmengine.config import read_base
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import OlympiadBenchDataset, OlympiadBenchEvaluator, olympiadbench_postprocess_v2
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
with read_base():
from .OlympiadBench_categories import math_categories as categories
# Create prompter instance for problems
olympiadbench_prompter_cfg = dict(
type='OlympiadBenchPrompter'
)
olympiadbench_reader_cfg = dict(
input_columns=[
'problem', 'language', 'subject', 'question_type',
'answer_type', 'is_multiple_answer', 'unit', 'questions'
],
output_column='solution'
)
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{problem}\n<Original Question End>\n\n
<Gold Target Begin>: \n{solution}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
olympiadbenchMath_datasets = []
for _name in categories:
olympiadbench_infer_cfg = dict(
prompt_template=dict(
type='OlympiadBenchTemplate'
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Evaluation configuration
olympiadbench_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
],
round=[
dict(
role='HUMAN',
prompt = GRADER_TEMPLATE
),
]),
),
dataset_cfg=dict(
type=OlympiadBenchDataset,
path='opencompass/OlympiadBench',
name=_name,
reader_cfg=olympiadbench_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
pred_role='BOT',
)
olympiadbenchMath_datasets.append(
dict(
type=OlympiadBenchDataset,
abbr=f'OlympiadBench_{_name}',
path='opencompass/OlympiadBench',
name=_name,
reader_cfg=olympiadbench_reader_cfg,
infer_cfg=olympiadbench_infer_cfg,
eval_cfg=olympiadbench_eval_cfg,
)
)
del _name

View File

@ -5,3 +5,14 @@ categories = [
'OE_TO_physics_en_COMP', # OpenEnded - TextOnly - physics - COMP
'OE_TO_physics_zh_CEE' # OpenEnded - TextOnly - physics - CEE
]
math_categories = [
'OE_TO_maths_en_COMP', # OpenEnded - TextOnly - maths - COMP
'OE_TO_maths_zh_COMP', # OpenEnded - TextOnly - maths - COMP
'OE_TO_maths_zh_CEE', # OpenEnded - TextOnly - maths - CEE
]
physics_categories = [
'OE_TO_physics_en_COMP', # OpenEnded - TextOnly - physics - COMP
'OE_TO_physics_zh_CEE' # OpenEnded - TextOnly - physics - CEE
]

View File

@ -1,53 +1,43 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import (
BigCodeBenchDataset,
BigCodeBenchEvaluator
)
from opencompass.datasets import (BigCodeBenchDataset, BigCodeBenchEvaluator)
bigcodebench_full_reader_cfg = dict(
input_columns=['complete_prompt'],
output_column='test',
input_columns=['complete_prompt'],
output_column='test',
)
bigcodebench_full_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[dict(role='system',
fallback_role='HUMAN',
prompt='')],
round=[
dict(role='HUMAN', prompt='{complete_prompt}'),
]
)
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=1024)
)
bigcodebench_full_infer_cfg = dict(prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[dict(role='system', fallback_role='HUMAN', prompt='')],
round=[
dict(role='HUMAN', prompt='{complete_prompt}'),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer,
max_out_len=1024))
bigcodebench_full_eval_cfg = dict(
evaluator=dict(
type=BigCodeBenchEvaluator,
release_version='v0.1.2',
eval_type='complete',
remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
# remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
remote_execute_api=
'https://opencompass-opencompass-bigcodebench-evaluator.hf.space', # noqa: E501
dataset_version='full',
),
pred_role='BOT',
)
bigcodebench_full_complete_datasets = [
dict(
abbr='bigcodebench_full_complete',
type=BigCodeBenchDataset,
path='opencompass/bigcodebench',
reader_cfg=bigcodebench_full_reader_cfg,
infer_cfg=bigcodebench_full_infer_cfg,
eval_cfg=bigcodebench_full_eval_cfg,
release_version='v0.1.2'
)
]
dict(abbr='bigcodebench_full_complete',
type=BigCodeBenchDataset,
path='opencompass/bigcodebench',
reader_cfg=bigcodebench_full_reader_cfg,
infer_cfg=bigcodebench_full_infer_cfg,
eval_cfg=bigcodebench_full_eval_cfg,
release_version='v0.1.2')
]

View File

@ -1,53 +1,43 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import (
BigCodeBenchDataset,
BigCodeBenchEvaluator
)
from opencompass.datasets import (BigCodeBenchDataset, BigCodeBenchEvaluator)
bigcodebench_full_reader_cfg = dict(
input_columns=['instruct_prompt'],
output_column='test',
input_columns=['instruct_prompt'],
output_column='test',
)
bigcodebench_full_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[dict(role='system',
fallback_role='HUMAN',
prompt='')],
round=[
dict(role='HUMAN', prompt='{instruct_prompt}'),
]
)
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=8192)
)
bigcodebench_full_infer_cfg = dict(prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[dict(role='system', fallback_role='HUMAN', prompt='')],
round=[
dict(role='HUMAN', prompt='{instruct_prompt}'),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer,
max_out_len=8192))
bigcodebench_full_eval_cfg = dict(
evaluator=dict(
type=BigCodeBenchEvaluator,
release_version='v0.1.2',
eval_type='instruct',
remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
# remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
remote_execute_api=
'https://opencompass-opencompass-bigcodebench-evaluator.hf.space', # noqa: E501
dataset_version='full',
),
pred_role='BOT',
)
bigcodebench_full_instruct_datasets = [
dict(
abbr='bigcodebench_full_instruct',
type=BigCodeBenchDataset,
path='opencompass/bigcodebench',
reader_cfg=bigcodebench_full_reader_cfg,
infer_cfg=bigcodebench_full_infer_cfg,
eval_cfg=bigcodebench_full_eval_cfg,
release_version='v0.1.2'
)
]
dict(abbr='bigcodebench_full_instruct',
type=BigCodeBenchDataset,
path='opencompass/bigcodebench',
reader_cfg=bigcodebench_full_reader_cfg,
infer_cfg=bigcodebench_full_infer_cfg,
eval_cfg=bigcodebench_full_eval_cfg,
release_version='v0.1.2')
]

View File

@ -1,40 +1,32 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import (
BigCodeBenchDataset,
BigCodeBenchEvaluator
)
from opencompass.datasets import (BigCodeBenchDataset, BigCodeBenchEvaluator)
bigcodebench_hard_reader_cfg = dict(
input_columns=['complete_prompt'],
output_column='test',
input_columns=['complete_prompt'],
output_column='test',
)
bigcodebench_hard_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[dict(role='system',
fallback_role='HUMAN',
prompt='')],
round=[
dict(role='HUMAN', prompt='{complete_prompt}'),
]
)
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=1024)
)
bigcodebench_hard_infer_cfg = dict(prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[dict(role='system', fallback_role='HUMAN', prompt='')],
round=[
dict(role='HUMAN', prompt='{complete_prompt}'),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer,
max_out_len=1024))
bigcodebench_hard_eval_cfg = dict(
evaluator=dict(
type=BigCodeBenchEvaluator,
release_version='v0.1.2',
eval_type='complete',
remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
# remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
remote_execute_api=
'https://opencompass-opencompass-bigcodebench-evaluator.hf.space', # noqa: E501
dataset_version='hard',
),
pred_role='BOT',
@ -51,4 +43,4 @@ bigcodebench_hard_complete_datasets = [
release_version='v0.1.2',
dataset_version='hard',
)
]
]

View File

@ -1,40 +1,32 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import (
BigCodeBenchDataset,
BigCodeBenchEvaluator
)
from opencompass.datasets import (BigCodeBenchDataset, BigCodeBenchEvaluator)
bigcodebench_hard_reader_cfg = dict(
input_columns=['instruct_prompt'],
output_column='test',
input_columns=['instruct_prompt'],
output_column='test',
)
bigcodebench_hard_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[dict(role='system',
fallback_role='HUMAN',
prompt='')],
round=[
dict(role='HUMAN', prompt='{instruct_prompt}'),
]
)
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=8192)
)
bigcodebench_hard_infer_cfg = dict(prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[dict(role='system', fallback_role='HUMAN', prompt='')],
round=[
dict(role='HUMAN', prompt='{instruct_prompt}'),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer,
max_out_len=8192))
bigcodebench_hard_eval_cfg = dict(
evaluator=dict(
type=BigCodeBenchEvaluator,
release_version='v0.1.2',
eval_type='instruct',
remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
# remote_execute_api='https://bigcode-bigcodebench-evaluator.hf.space/',
remote_execute_api=
'https://opencompass-opencompass-bigcodebench-evaluator.hf.space', # noqa: E501
dataset_version='hard',
),
pred_role='BOT',
@ -51,4 +43,4 @@ bigcodebench_hard_instruct_datasets = [
release_version='v0.1.2',
dataset_version='hard',
)
]
]

View File

@ -0,0 +1,132 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import (LCBCodeGenerationDataset,
LCBCodeExecutionDataset,
LCBTestOutputPredictionDataset,
LCBCodeGenerationEvaluator,
LCBCodeExecutionEvaluator,
LCBTestOutputEvaluator)
lcb_code_generation_reader_cfg = dict(
input_columns=[
'question_content',
'format_prompt',
],
# output_column='evaluation_sample',
output_column='question_id',
)
SYSTEM_MESSAGE_GENERIC = 'You are an expert Python programmer. You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests. You will NOT return anything except for the program.' # noqa: E501
prompt_template = '### Question:\n{question_content}\n\n{format_prompt}' + \
'### Answer: (use the provided format with backticks)\n\n'
# Code Generation Tasks
lcb_code_generation_infer_cfg = dict(prompt_template=dict(
type=PromptTemplate,
template=dict(round=[dict(role='HUMAN', prompt=prompt_template)])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
lcb_code_generation_eval_cfg = dict(
evaluator=dict(type=LCBCodeGenerationEvaluator,
num_process_evaluate=4,
timeout=6,
release_version='release_v5',
start_date='2024-08-01',
end_date='2025-02-01'),
pred_role='BOT',
)
LCBCodeGeneration_dataset = dict(
type=LCBCodeGenerationDataset,
abbr='lcb_code_generation',
path='opencompass/code_generation_lite',
reader_cfg=lcb_code_generation_reader_cfg,
infer_cfg=lcb_code_generation_infer_cfg,
eval_cfg=lcb_code_generation_eval_cfg,
release_version='release_v5',
)
# Code Execution Dataset
lcb_code_execution_reader_cfg = dict(
input_columns=[
'prompt',
],
output_column='evaluation_sample',
)
lcb_code_execution_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt=
'You are an expert at Python programming, code execution, test case generation, and fuzzing.' # noqa: E501
),
],
round=[dict(role='HUMAN', prompt='{prompt}')])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
lcb_code_execution_eval_cfg = dict(
evaluator=dict(type=LCBCodeExecutionEvaluator, ),
pred_role='BOT',
)
LCBCodeExecution_dataset = dict(
type=LCBCodeExecutionDataset,
abbr='lcb_code_execution',
path='opencompass/execution-v2',
reader_cfg=lcb_code_execution_reader_cfg,
infer_cfg=lcb_code_execution_infer_cfg,
eval_cfg=lcb_code_execution_eval_cfg,
)
# TestOuputput Dataset
lcb_test_output_reader_cfg = dict(
input_columns=[
'prompt',
],
output_column='evaluation_sample',
)
system_prompt = 'You are an expert Python programmer. You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests. You will NOT return anything except for the program.' # noqa: E501
lcb_test_output_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
# begin=[
# dict(
# role='SYSTEM',
# prompt=system_prompt
# ),
# ],
round=[dict(role='HUMAN', prompt='{prompt}')])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
lcb_test_output_eval_cfg = dict(
evaluator=dict(type=LCBTestOutputEvaluator, ),
pred_role='BOT',
)
LCBTestOutput_dataset = dict(
type=LCBTestOutputPredictionDataset,
abbr='lcb_test_output',
path='opencompass/test_generation',
reader_cfg=lcb_test_output_reader_cfg,
infer_cfg=lcb_test_output_infer_cfg,
eval_cfg=lcb_test_output_eval_cfg,
)
LCB_datasets = [
LCBCodeGeneration_dataset,
LCBCodeExecution_dataset,
LCBTestOutput_dataset,
]

View File

@ -0,0 +1,96 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import CustomDataset
from opencompass.datasets import generic_llmjudge_postprocess
from itertools import product
livemathbench_reader_cfg = dict(input_columns=['question'], output_column='answer')
# Inference configuration
livemathbench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt='{question}\n',
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
# Template for the LLM judge
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{question}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
splits = ['hard_cn', 'hard_en']
# Dataset configuration
livemathbench_datasets = [
dict(
type=CustomDataset,
abbr=f'livemathbench_hard_custom_{split}_run{run_idx}',
path='data/LiveMathBench',
local_mode=True,
file_name=f'202412/{split}.jsonl',
reader_cfg=livemathbench_reader_cfg,
infer_cfg=livemathbench_infer_cfg,
eval_cfg=dict(
# # Evaluation configuration using LLM as judge
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.",
)
],
round=[
dict(role='HUMAN', prompt=GRADER_TEMPLATE),
],
),
),
dataset_cfg=dict(
type=CustomDataset,
path='data/LiveMathBench',
local_mode=True,
file_name=f'202412/{split}.jsonl',
reader_cfg=livemathbench_reader_cfg,
),
judge_cfg={},
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
),
) for split, run_idx in product(splits, range(8))
]

View File

@ -9,7 +9,7 @@ livemathbench_dataset = dict(
type=LiveMathBenchDataset,
path='',
k=16,
replication=3,
n=48,
dataset_splits=['hard'],
dataset_languages=['cn', 'en'],
cot=True,
@ -37,13 +37,7 @@ livemathbench_dataset = dict(
evaluator=dict(
type=LiveMathBenchEvaluator,
model_name='',
url=[],
use_extract_model=False,
extract_url=[],
extract_model_name='',
k=[4, 8, 16],
replication=3,
thresholds=[0.0, 0.25, 0.5, 0.75, 1.0]
url=[]
)
)
)

View File

@ -9,7 +9,7 @@ livemathbench_dataset = dict(
type=LiveMathBenchDataset,
path='',
k=1,
replication=1,
n=1,
dataset_splits=['hard'],
dataset_languages=['cn', 'en'],
cot=True,
@ -37,13 +37,7 @@ livemathbench_dataset = dict(
evaluator=dict(
type=LiveMathBenchEvaluator,
model_name='',
url=[],
use_extract_model=False,
extract_url=[],
extract_model_name='',
k=[1],
replication=1,
thresholds=[0.0]
url=[]
)
)
)

View File

@ -88,7 +88,7 @@ math_eval_cfg = dict(
math_datasets = [
dict(
type=MATHDataset,
abbr=f'math_prm800k_500-llmjudge-run{idx}',
abbr=f'math_prm800k_500-llmverify-run{idx}',
path='opencompass/math',
file_name = 'test_prm800k_500.json',
reader_cfg=math_reader_cfg,

View File

@ -0,0 +1,89 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.evaluator import GenericLLMEvaluator
from opencompass.datasets import generic_llmjudge_postprocess
from opencompass.datasets.omni_math import OmniMathDataset
omnimath_reader_cfg = dict(
input_columns=['problem'],
output_column='answer'
)
omnimath_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='please answer the following mathematical question, put your final answer in \\boxed{}.\n\n{problem}'),
]
)
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer)
)
GRADER_TEMPLATE = """
Please as a grading expert, judge whether the final answers given by the candidates below are consistent with the standard answers, that is, whether the candidates answered correctly.
Here are some evaluation criteria:
1. Please refer to the given standard answer. You don't need to re-generate the answer to the question because the standard answer has been given. You only need to judge whether the candidate's answer is consistent with the standard answer according to the form of the question. Don't try to answer the original question. You can assume that the standard answer is definitely correct.
2. Because the candidate's answer may be different from the standard answer in the form of expression, before making a judgment, please understand the question and the standard answer first, and then judge whether the candidate's answer is correct, but be careful not to try to answer the original question.
3. Some answers may contain multiple items, such as multiple-choice questions, multiple-select questions, fill-in-the-blank questions, etc. As long as the answer is the same as the standard answer, it is enough. For multiple-select questions and multiple-blank fill-in-the-blank questions, the candidate needs to answer all the corresponding options or blanks correctly to be considered correct.
4. Some answers may be expressed in different ways, such as some answers may be a mathematical expression, some answers may be a textual description, as long as the meaning expressed is the same. And some formulas are expressed in different ways, but they are equivalent and correct.
5. If the prediction is given with \\boxed{}, please ignore the \\boxed{} and only judge whether the candidate's answer is consistent with the standard answer.
Please judge whether the following answers are consistent with the standard answer based on the above criteria. Grade the predicted answer of this new question as one of:
A: CORRECT
B: INCORRECT
Just return the letters "A" or "B", with no text around it.
Here is your task. Simply reply with either CORRECT, INCORRECT. Don't apologize or correct yourself if there was a mistake; we are just trying to grade the answer.
<Original Question Begin>: \n{problem}\n<Original Question End>\n\n
<Gold Target Begin>: \n{answer}\n<Gold Target End>\n\n
<Predicted Answer Begin>: \n{prediction}\n<Predicted End>\n\n
Judging the correctness of candidates' answers:
""".strip()
omnimath_eval_cfg = dict(
evaluator=dict(
type=GenericLLMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt="You are a helpful assistant who evaluates the correctness and quality of models' outputs.")
],
round=[
dict(
role='HUMAN',
prompt = GRADER_TEMPLATE
),
]),
),
dataset_cfg=dict(
type=OmniMathDataset,
reader_cfg=omnimath_reader_cfg,
),
judge_cfg=dict(),
dict_postprocessor=dict(type=generic_llmjudge_postprocess),
),
)
omnimath_datasets = [
dict(
type=OmniMathDataset,
abbr='OmniMath',
reader_cfg=omnimath_reader_cfg,
infer_cfg=omnimath_infer_cfg,
eval_cfg=omnimath_eval_cfg
)
]

View File

@ -9,3 +9,14 @@ categories = [
OlympiadBench_summary_groups = [
{'name': 'OlympiadBench', 'subsets': ['OlympiadBench_' + c.replace(' ', '_') for c in categories]},
]
math_categories = [
'OE_TO_maths_en_COMP', # OpenEnded - TextOnly - maths - COMP
'OE_TO_maths_zh_COMP', # OpenEnded - TextOnly - maths - COMP
'OE_TO_maths_zh_CEE', # OpenEnded - TextOnly - maths - CEE
]
OlympiadBenchMath_summary_groups = [
{'name': 'OlympiadBenchMath', 'subsets': ['OlympiadBench_' + c.replace(' ', '_') for c in math_categories]},
]

View File

@ -65,7 +65,7 @@ class TheoremQAEvaluatorV3(BaseEvaluator):
{
# "question": question,
# "solution": output,
"correct": groundtruth,
# "correct": groundtruth,
"pred": answer,
"is_correct": is_correct,
}

View File

@ -57,6 +57,7 @@ from .gpqa import * # noqa: F401, F403
from .gsm8k import * # noqa: F401, F403
from .gsm_hard import * # noqa: F401, F403
from .hellaswag import * # noqa: F401, F403
from .hle import * # noqa: F401, F403
from .huggingface import * # noqa: F401, F403
from .humaneval import * # noqa: F401, F403
from .humaneval_multi import * # noqa: F401, F403

View File

@ -197,11 +197,21 @@ class BigCodeBenchEvaluator(BaseEvaluator):
break
except (httpx.ReadTimeout, CancelledError):
logger.info('Read timeout error. Retrying in 4s...')
time.sleep(4)
time.sleep(10)
if 'pass@1' in pass_at_k.keys():
pass_at_k['pass@1'] *= 100
dump_results = {'details': results}
dump_results = {'details': self._results_processor(results)}
dump_results.update(pass_at_k)
return dump_results
def _results_processor(self, results):
details = []
for key, value in results['eval'].items():
if value[0]['status'] == 'pass':
value[0]['correct'] = True
else:
value[0]['correct'] = False
details.append(value[0])
return details

View File

@ -1,5 +1,7 @@
import re
from opencompass.utils import get_logger
def get_final_results(judged_answers,
references,
@ -68,7 +70,13 @@ def generic_llmjudge_postprocess(
processed_judge = _generic_llmjudge_postprocess(v['prediction'])
if processed_judge is not None:
judged_answers.append(processed_judge)
references.append(v['gold'])
try:
references.append(v['gold'])
except KeyError:
get_logger().warning(
f'No gold answer for {k}, use empty string as reference!')
references.append('')
results = get_final_results(judged_answers, references, origial_responses)
results['details'] = output
return results

View File

@ -0,0 +1,17 @@
from datasets import load_dataset
from opencompass.registry import LOAD_DATASET
from .base import BaseDataset
@LOAD_DATASET.register_module()
class HLEDataset(BaseDataset):
@staticmethod
def load(path: str):
dataset = load_dataset(path)
dataset['test'] = dataset['test'].filter(lambda x: x['image'] == '')
dataset['test'] = dataset['test'].rename_column('question', 'problem')
dataset['train'] = dataset['test']
return dataset

View File

@ -146,9 +146,12 @@ def evaluate_generations(
with ProcessPoolExecutor(
max_workers=1 if debug else num_process_evaluate) as executor:
futures = {
executor.submit(evaluate_generations_by_problem,
problem_generations, sample, debug, timeout):
index
executor.submit(
evaluate_generations_by_problem, # noqa: E501
problem_generations,
sample,
debug,
timeout): index
for (problem_generations, sample, debug,
timeout), index in inputs
}
@ -233,15 +236,27 @@ class LCBCodeGenerationEvaluator(BaseEvaluator):
num_process_evaluate,
timeout=6,
release_version='release_v1',
extractor_version='v1'):
extractor_version='v1',
start_date=None,
end_date=None):
super().__init__()
self.num_process_evaluate = num_process_evaluate
self.timeout = timeout
self.dataset = LCBCodeGenerationDataset.load(
release_version=release_version)['test']
release_version=release_version,
start_date=start_date,
end_date=end_date)['test']
self.extractor_version = extractor_version
def score(self, predictions, references):
if len(predictions) != len(references):
return {
'error':
'predictions and references have different '
f'length. len(predictions): {len(predictions)}, '
f'len(references): {len(references)}'
}
if self.extractor_version == 'v1':
predictions = [[extract_code_generation(item)]
for item in predictions]
@ -254,19 +269,28 @@ class LCBCodeGenerationEvaluator(BaseEvaluator):
evaluation_samples[self.dataset[idx][
'question_id']] = self.dataset[idx]['evaluation_sample']
references = [evaluation_samples[item] for item in references]
filtered_predictions = []
filtered_references = []
for idx, item in enumerate(references):
if item in self.dataset['question_id']:
filtered_predictions.append(predictions[idx])
filtered_references.append(item)
references = [{'input_output': item} for item in references]
filtered_references = [
evaluation_samples[item] for item in filtered_references
] # noqa: E501
BaseEvaluator.is_num_equal(predictions, references)
filtered_references = [{
'input_output': item
} for item in filtered_references] # noqa: E501
extracted_predictions = {}
for idx, content in enumerate(predictions):
for idx, content in enumerate(filtered_predictions):
extracted_predictions[idx] = content
metrics, eval_results, final_metadata = codegen_metrics(
references,
predictions,
filtered_references,
filtered_predictions,
k_list=[1],
num_process_evaluate=self.num_process_evaluate,
timeout=self.timeout,

View File

@ -6,6 +6,7 @@ import json
import pickle
import zlib
from dataclasses import dataclass
from datetime import datetime
from enum import Enum
from datasets import DatasetDict, load_dataset, load_from_disk
@ -53,7 +54,9 @@ class LCBCodeGenerationDataset(BaseDataset):
@staticmethod
def load(path: str = 'opencompass/code_generation_lite',
local_mode: bool = False,
release_version: str = 'release_v1'):
release_version: str = 'release_v1',
start_date: str = None,
end_date: str = None):
def transform(item):
# Define the dataitem mapping logic
@ -61,7 +64,7 @@ class LCBCodeGenerationDataset(BaseDataset):
# starter_code
if item['starter_code']:
format_prompt = f'### Format: {CodeGenerationPromptConstants.FORMATTING_MESSAGE_WITH_STARTER_CODE}\n' # noqa: E501
format_prompt += f"```python\n{item['starter_code']}\n```\n\n"
format_prompt += f"```python\n{item['starter_code']}\n```\n\n" # noqa: Q000, E501
else:
format_prompt = f'### Format: {CodeGenerationPromptConstants.FORMATTING_WITHOUT_STARTER_CODE}\n' # noqa: E501
format_prompt += '```python\n# YOUR CODE HERE\n```\n\n'
@ -107,6 +110,16 @@ class LCBCodeGenerationDataset(BaseDataset):
dataset = dataset.map(transform)
if start_date is not None:
p_start_date = datetime.strptime(start_date, '%Y-%m-%d')
dataset = dataset.filter(
lambda e: p_start_date <= datetime.fromisoformat(e[
'contest_date'])) # noqa: E501
if end_date is not None:
p_end_date = datetime.strptime(end_date, '%Y-%m-%d')
dataset = dataset.filter(lambda e: datetime.fromisoformat(e[
'contest_date']) <= p_end_date) # noqa: E501
return DatasetDict({'test': dataset, 'train': dataset})

View File

@ -41,9 +41,8 @@ class LiveMathBenchDataset(BaseDataset):
dataset = []
dataset_info = {}
if path != '':
path = get_data_path(path)
path = os.path.join(path, version)
# Use dataset mapping to generate path
data_dir = get_data_path(path)
for split, language in product(dataset_splits, dataset_languages):
dataset_info[f'{split}_{language}'] = {
@ -59,8 +58,17 @@ class LiveMathBenchDataset(BaseDataset):
'问答': 'problem-solving'
}
if path != '':
file_path = os.path.join(path, f'{split}_{language}.jsonl')
examples = []
if data_dir.startswith('opencompass/'):
# Using HF Dataset
hf_dataset = load_dataset(
data_dir, f'v{version}_{split}_{language}')['test']
for example in hf_dataset:
examples.append(example)
else:
file_path = os.path.join(data_dir, version,
f'{split}_{language}.jsonl')
if not os.path.exists(file_path):
raise FileNotFoundError(
f'File {file_path} does not exist, please check the '
@ -69,13 +77,6 @@ class LiveMathBenchDataset(BaseDataset):
with jsonlines.open(file_path, 'r') as file:
for example in file:
examples.append(example)
else:
hf_dataset = load_dataset(
'opencompass/LiveMathBench',
f'v{version}_{split}_{language}')['test']
examples = []
for example in hf_dataset:
examples.append(example)
for example_idx, example in enumerate(examples):
dataset_info[f'{split}_{language}'][

View File

@ -130,6 +130,7 @@ class TurboMindModelwithChatTemplate(BaseModel):
if self.fastchat_template:
messages = _format_with_fast_chat_template(messages, self.fastchat_template)
else:
# NOTE: DeepSeek-R1 series model's chat template will add <think> after the
messages = [self.tokenizer.apply_chat_template(m, add_generation_prompt=True, tokenize=False) for m in messages]
# LMDeploy tokenize prompts by AutoTokenizer with its default parameter "add_special_token=True"
# OC add bos_token in the prompt, which requires tokenizing prompts using "add_speicial_token=False"

View File

@ -2,8 +2,12 @@
import json
from opencompass.openicl.icl_evaluator import BaseEvaluator
from opencompass.registry import ICL_EVALUATORS
class TEvalEvaluator:
@ICL_EVALUATORS.register_module()
class TEvalEvaluator(BaseEvaluator):
"""This module contains the following evaluators for evaluating the
capabilities of the various dimensions of the LLM.

View File

@ -102,6 +102,7 @@ class BasePartitioner:
return tasks
def parse_model_dataset_args(self, cfg: ConfigDict):
models = cfg['models']
datasets = cfg['datasets']
@ -109,7 +110,24 @@ class BasePartitioner:
if 'model_dataset_combinations' in sig.parameters:
combs = cfg.get('model_dataset_combinations', None)
if combs is None:
combs = [{'models': models, 'datasets': datasets}]
if 'rs_exist_results' in cfg.keys():
rs_exist_results = cfg['rs_exist_results']
combs = []
for model in models:
comb = {'models': [model], 'datasets': datasets}
combs.append(comb)
for i in range(len(combs)):
combs[i]['datasets'] = [
dataset for dataset in combs[i]['datasets'] if [
model_abbr_from_cfg(combs[i]['models'][0]),
dataset_abbr_from_cfg(dataset)
] not in rs_exist_results
]
combs = [
comb for comb in combs if len(comb['datasets']) != 0
]
else:
combs = [{'models': models, 'datasets': datasets}]
else:
# sanity check
model_abbrs = [model_abbr_from_cfg(model) for model in models]

View File

@ -14,4 +14,5 @@ from .model_postprocessors import * # noqa
from .network import * # noqa
from .postprocessors import * # noqa
from .prompt import * # noqa
from .result_station import * # noqa
from .text_postprocessors import * # noqa

View File

@ -376,7 +376,7 @@ DATASETS_MAPPING = {
"opencompass/LiveReasonBench": {
"ms_id": "",
"hf_id": "",
"local": "./data/LiveReasonBench/",
"local": "./data/LiveReasonBench/",
},
"opencompass/bigcodebench": {
"ms_id": "",
@ -412,251 +412,313 @@ DATASETS_MAPPING = {
DATASETS_URL = {
"/OlympiadBench": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/OlympiadBench.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/OlympiadBench.zip",
"md5": "97e8b1ae7f6170d94817288a8930ef00",
},
"/longbenchv2":{
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/longbenchv2.zip",
"/longbenchv2": {
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/longbenchv2.zip",
"md5": "09b7e06e6f98c5cca8ad597b3d7b42f0",
},
"/livestembench": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/livestembench.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/livestembench.zip",
"md5": "0ff59d031c3dcff56a2e00e8c1489f5d",
},
"/musr": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/musr.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/musr.zip",
"md5": "7447d2a5bec4586035196102135e2af9",
},
"/mmlu/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip",
"md5": "761310671509a239e41c4b717f7fab9c",
},
"/mmmlu_lite": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmmlu_lite.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmmlu_lite.zip",
"md5": "a776af1220e1826fd0608eda1bc4425e",
},
"/simpleqa": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/simpleqa.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/simpleqa.zip",
"md5": "1d83fc2e15798d39cb265c9a3cb5195a",
},
"/chinese_simpleqa": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/chinese_simpleqa.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/chinese_simpleqa.zip",
"md5": "4bdf854b291fc0ee29da57dc47ac47b5",
},
"/gpqa/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gpqa.zip",
"md5": "2e9657959030a765916f1f2aca29140d",
},
"/CHARM/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/CHARM.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/CHARM.zip",
"md5": "fdf51e955d1b8e0bb35bc1997eaf37cb",
},
"/ifeval/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ifeval.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ifeval.zip",
"md5": "64d98b6f36b42e7390c9cef76cace75f",
},
"/mbpp/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mbpp.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mbpp.zip",
"md5": "777739c90f04bce44096a5bc96c8f9e5",
},
"/cmmlu/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/cmmlu.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/cmmlu.zip",
"md5": "a59f4003d6918509a719ce3bc2a5d5bc",
},
"/math/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/math.zip",
"md5": "cb5b4c8378085929e20345174e731fdf",
},
"/hellaswag/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/hellaswag.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/hellaswag.zip",
"md5": "2b700a02ffb58571c7df8d8d0619256f",
},
"/BBH/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/BBH.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/BBH.zip",
"md5": "60c49f9bef5148aa7e1941328e96a554",
},
"/compass_arena/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/compass_arena.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/compass_arena.zip",
"md5": "cd59b54a179d16f2a858b359b60588f6",
},
"/TheoremQA/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/TheoremQA.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/TheoremQA.zip",
"md5": "f2793b07bc26510d507aa710d9bd8622",
},
"/mathbench_v1/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mathbench_v1.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mathbench_v1.zip",
"md5": "50257a910ca43d1f61a610a79fdb16b5",
},
"/gsm8k/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/gsm8k.zip",
"md5": "901e5dc93a2889789a469da9850cdca8",
},
"/LCBench2023/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/LCBench2023.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/LCBench2023.zip",
"md5": "e1a38c94a42ad1809e9e0650476a9306",
},
"/humaneval/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/humaneval.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/humaneval.zip",
"md5": "88b1b89dc47b7121c81da6bcd85a69c3",
},
"/humanevalx": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/humanevalx.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/humanevalx.zip",
"md5": "22930355c03fb73fb5bae14b50f1deb9",
},
"/ds1000_data": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ds1000_data.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ds1000_data.zip",
"md5": "1a4990aec04a2fd73ccfad12e2d43b43",
},
"/drop_simple_eval/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/drop_simple_eval.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/drop_simple_eval.zip",
"md5": "c912afe5b4a63509851cf16e6b91830e",
},
"subjective/alignment_bench/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/alignment_bench.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/alignment_bench.zip",
"md5": "d8ae9a0398526479dbbcdb80fafabceb",
},
"subjective/alpaca_eval": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/alpaca_eval.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/alpaca_eval.zip",
"md5": "d7399d63cb46c82f089447160ef49b6a",
},
"subjective/arena_hard": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/arena_hard.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/arena_hard.zip",
"md5": "02cd09a482cb0f0cd9d2c2afe7a1697f",
},
"subjective/mtbench": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mtbench.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mtbench.zip",
"md5": "d1afc0787aeac7f1f24872742e161069",
},
"subjective/fofo": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/fofo.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/fofo.zip",
"md5": "8a302712e425e27e4292a9369df5b9d3",
},
"subjective/followbench": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/followbench.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/followbench.zip",
"md5": "da7a831817c969da15d1e78d4a245d8a",
},
"subjective/mtbench101": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mtbench101.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mtbench101.zip",
"md5": "5d80257bc9929ebe5cfbf6d11184b04c",
},
"subjective/WildBench": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/wildbench.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/wildbench.zip",
"md5": "b06252857f1f8f44a17b1bfca4888ff4",
},
"/ruler/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ruler.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ruler.zip",
"md5": "c60bdfff3d02358067104cc1dea7c0f7",
},
"/scicode": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/scicode.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/scicode.zip",
"md5": "9c6c64b8c70edc418f713419ea39989c",
},
"/commonsenseqa": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/commonsenseqa.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/commonsenseqa.zip",
"md5": "c4a82fc07c81ae1462605f5d7fd2bb2e",
},
"FewCLUE": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/FewCLUE.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/FewCLUE.zip",
"md5": "7976e2bb0e9d885ffd3c55f7c5d4021e",
},
"/race": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/race.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/race.zip",
"md5": "b758251764a264746cf45749c02363f9",
},
"/ARC": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ARC.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/ARC.zip",
"md5": "d720629b69f1a51cfe78bf65b00b44f6",
},
"/SuperGLUE": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/SuperGLUE.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/SuperGLUE.zip",
"md5": "b60904915b0b61d1a04ea52280169936",
},
"SQuAD2.0": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/SQuAD2.0.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/SQuAD2.0.zip",
"md5": "1321cbf9349e1102a57d31d1b2bfdd7e",
},
"mmlu_pro": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu_pro.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu_pro.zip",
"md5": "e3200c7380f4cea5f13c768f2815fabb",
},
"/Longbench": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/Longbench.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/Longbench.zip",
"md5": "ab0cb9e520ae5cfb899bf38b564249bb",
},
"/needlebench": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/needlebench.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/needlebench.zip",
"md5": "dad5c903ebfea16eaf186b8997aeedad",
},
"/teval": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/teval.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/teval.zip",
"md5": "7628ab5891a26bf96ca17becfd044867",
},
"/code_generation_lite": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/code_generation_lite.zip",
"md5": "60103a18ca63b05ea06e98d24170f23d",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/code_generation_lite.zip",
"md5": "ebcf8db56f5c817ca8202a542be30cb4",
},
"/execution-v2": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/execution-v2.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/execution-v2.zip",
"md5": "019ef1a0686ee6ca34f51c8af104fcd9",
},
"/test_generation": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/test_generation.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/test_generation.zip",
"md5": "918a6ea2b1eee6f2b1314db3c21cb4c7",
},
"/aime": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/aime.zip",
"md5": "fbe2d0577fc210962a549f8cea1a00c8",
},
"/cmo": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/cmo.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/cmo.zip",
"md5": "fad52c81290506a8ca74f46b5400d8fc",
},
},
"/nq-open": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/nq-open.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/nq-open.zip",
"md5": "a340521e5c9ec591227dcb367f718b25",
},
"/winogrande": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/winogrande.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/winogrande.zip",
"md5": "9e949a75eacc26ed4fd2b9aa870b495b",
},
"/triviaqa": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/triviaqa.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/triviaqa.zip",
"md5": "e6a118d744236814926b2ec7ec66c034",
},
"/GAOKAO-BENCH": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/GAOKAO-BENCH.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/GAOKAO-BENCH.zip",
"md5": "ba3c71b8b9db96d2a0664b977c4f9784",
},
"/WikiBench": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/WikiBench.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/WikiBench.zip",
"md5": "6dac1d1a3133fe1effff185cbf71d928",
},
"/babilong": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/babilong.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/babilong.zip",
"md5": "e400864c31bc58d29eaa3e199751f99b",
},
"/korbench": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/korbench.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/korbench.zip",
"md5": "9107597d137e7362eaf7d218ddef7a6d",
},
"subjective/judgerbench": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/judgerbench.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/judgerbench.zip",
"md5": "60d605883aa8cac9755819140ab42c6b"
},
"/arc_prize_public_evaluation": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/arc_prize_public_evaluation.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/arc_prize_public_evaluation.zip",
"md5": "367a33977651496efddba7670009807e"
},
"P-MMEval": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/pmmeval.zip",
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/pmmeval.zip",
"md5": "09e401e6229a50647b9e13c429e634d1",
},
"LiveMathBench": {
'url': "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/LiveMathBench.zip",
'url':
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/LiveMathBench.zip",
"md5": "d0781f9185c9bb50e81e6e3ca8c59013",
},
"bigcodebench": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/bigcodebench.zip",
"md5": "2c1c7956ca49a1124617e8c037ec57d8"
"url":
"http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/bigcodebench.zip",
"md5": "270f399f4142b74f47ecff116cc3b21d"
}
}

View File

@ -0,0 +1,417 @@
import json
import os
import os.path as osp
import re
from opencompass.utils.abbr import (dataset_abbr_from_cfg,
deal_with_judge_model_abbr,
model_abbr_from_cfg)
def save_to_station(cfg, args):
if args.station_path is not None:
station_path = args.station_path
else:
station_path = cfg.get('station_path')
work_dict = cfg['work_dir']
# objective dataset processing
if 'judge_models' not in cfg.keys():
model_list = [model_abbr_from_cfg(model) for model in cfg['models']]
dataset_list = [
dataset_abbr_from_cfg(dataset) for dataset in cfg['datasets']
]
rs_exist_results = []
if 'rs_exist_results' in cfg.keys():
rs_exist_results = cfg['rs_exist_results']
for dataset in dataset_list:
result_path = osp.join(station_path, dataset)
if not osp.exists(result_path):
os.makedirs(result_path)
for model in model_list:
if ([model, dataset] in rs_exist_results
and not args.station_overwrite):
continue
result_file_name = model + '.json'
if osp.exists(osp.join(
result_path,
result_file_name)) and not args.station_overwrite:
print('result of {} with {} already exists'.format(
dataset, model))
continue
else:
# get result dict
local_result_path = osp.join(work_dict, 'results', model)
local_result_json = osp.join(local_result_path,
dataset + '.json')
if not osp.exists(local_result_json):
if args.mode == 'viz':
continue
raise ValueError(
'invalid file: {}'.format(local_result_json))
with open(local_result_json, 'r') as f:
this_result = json.load(f)
f.close()
# get prediction list
local_prediction_path = osp.join(work_dict, 'predictions',
model)
local_prediction_regex = \
rf'^{re.escape(dataset)}(?:_\d+)?\.json$'
local_prediction_json = find_files_by_regex(
local_prediction_path, local_prediction_regex)
if not check_filenames(
dataset,
local_prediction_json) and args.mode != 'viz':
raise ValueError('invalid filelist: {}'.format(
local_prediction_json))
this_prediction = []
for prediction_json in local_prediction_json:
with open(
osp.join(local_prediction_path,
prediction_json), 'r') as f:
this_prediction_load_json = json.load(f)
f.close()
for prekey in this_prediction_load_json.keys():
this_prediction.append(
this_prediction_load_json[prekey])
# get config dict
model_cfg = [
i for i in cfg['models']
if model_abbr_from_cfg(i) == model
][0]
dataset_cfg = [
i for i in cfg['datasets']
if dataset_abbr_from_cfg(i) == dataset
][0]
this_cfg = {'models': model_cfg, 'datasets': dataset_cfg}
# dict combine
data_model_results = {
'predictions': this_prediction,
'results': this_result,
'cfg': this_cfg
}
with open(osp.join(result_path, result_file_name),
'w') as f:
json.dump(data_model_results,
f,
ensure_ascii=False,
indent=4)
f.close()
print(
'successfully save result of {} with {} to the station'
.format(dataset, model))
return True
# subjective processing
else:
model_list = [model for model in cfg['models']]
judge_list = [judge_model for judge_model in cfg['judge_models']]
model_pair_list = [[
deal_with_judge_model_abbr(model, judge_model)
for judge_model in judge_list
] for model in model_list]
dataset_list = [[
dataset_abbr_from_cfg(dataset),
[dataset_abbr_from_cfg(base) for base in dataset['base_models']]
] if 'base_models' in dataset.keys() else
[dataset_abbr_from_cfg(dataset), ['']]
for dataset in cfg['datasets']]
rs_exist_results = []
if 'rs_exist_results' in cfg.keys():
rs_exist_results = cfg['rs_exist_results']
for pair_of_dataset_and_base in dataset_list:
dataset, base_list = pair_of_dataset_and_base[
0], pair_of_dataset_and_base[1]
result_path = osp.join(station_path, dataset)
if not osp.exists(result_path):
os.makedirs(result_path)
for base_model in base_list:
base_model_name = base_model
if base_model_name != '':
base_model_name += '_'
for model_pair_sub_list in model_pair_list:
for model_pair in model_pair_sub_list:
model = model_abbr_from_cfg(model_pair[0])
model_result = model_abbr_from_cfg(model_pair)
if ([model, dataset] in rs_exist_results
and not args.station_overwrite):
continue
result_file_name = (base_model_name + model_result +
'.json')
if osp.exists(osp.join(result_path, result_file_name)
) and not args.station_overwrite:
print('{} at {} already exists'.format(
result_file_name, result_path))
continue
else:
# get result dict
local_result_path = osp.join(
work_dict, 'results',
base_model_name + model_result)
local_result_json = osp.join(
local_result_path, dataset + '.json')
if not osp.exists(local_result_json):
if args.mode == 'viz':
continue
raise ValueError('invalid file: {}'.format(
local_result_json))
with open(local_result_json, 'r') as f:
this_result = json.load(f)
f.close()
# get prediction list
local_prediction_path = osp.join(
work_dict, 'predictions', model)
local_prediction_regex = \
rf'^{re.escape(dataset)}(?:_\d+)?\.json$'
local_prediction_json = find_files_by_regex(
local_prediction_path, local_prediction_regex)
if not check_filenames(dataset,
local_prediction_json
) and args.mode != 'viz':
raise ValueError('invalid filelist: {}'.format(
local_prediction_json))
this_prediction = []
for prediction_json in local_prediction_json:
with open(
osp.join(local_prediction_path,
prediction_json), 'r') as f:
this_prediction_load_json = json.load(f)
f.close()
for prekey in this_prediction_load_json.keys():
this_prediction.append(
this_prediction_load_json[prekey])
# get config dict
model_cfg = [
i for i in cfg['models']
if model_abbr_from_cfg(i) == model
][0]
dataset_cfg = [
i for i in cfg['datasets']
if dataset_abbr_from_cfg(i) == dataset
][0]
judge_model_cfg = [
i for i in cfg['judge_models']
if 'judged-by--' + model_abbr_from_cfg(i) ==
model_abbr_from_cfg(model_pair[1])
]
this_cfg = {
'models': model_cfg,
'datasets': dataset_cfg,
'judge_models': judge_model_cfg
}
# dict combine
data_model_results = {
'predictions': this_prediction,
'results': this_result,
'cfg': this_cfg
}
with open(osp.join(result_path, result_file_name),
'w') as f:
json.dump(data_model_results,
f,
ensure_ascii=False,
indent=4)
f.close()
print('successfully save result: {} at {} to the'
'station'.format(result_file_name,
result_path))
return True
def read_from_station(cfg, args):
assert args.station_path is not None or cfg.get('station_path') is not None
if args.station_path is not None:
station_path = args.station_path
else:
station_path = cfg.get('station_path')
# objective check
if 'judge_models' not in cfg.keys():
model_list = [model_abbr_from_cfg(model) for model in cfg['models']]
dataset_list = [
dataset_abbr_from_cfg(dataset) for dataset in cfg['datasets']
]
existing_results_list = []
result_local_path = osp.join(cfg['work_dir'], 'results')
if not osp.exists(result_local_path):
os.makedirs(result_local_path)
for dataset in dataset_list:
for model in model_list:
result_file_path = osp.join(station_path, dataset,
model + '.json')
if not osp.exists(result_file_path):
print('do not find result file: {} with {} at station'.
format(model, dataset))
continue
else:
print('find result file: {} with {} at station'.format(
model, dataset))
with open(result_file_path, 'r') as f:
download_json = json.load(f)
f.close()
existing_results_list.append({
'combination': [model, dataset],
'file':
download_json
})
# save results to local
for i in existing_results_list:
this_result = i['file']['results']
this_result_local_path = osp.join(result_local_path,
i['combination'][0])
if not osp.exists(this_result_local_path):
os.makedirs(this_result_local_path)
this_result_local_file_path = osp.join(
this_result_local_path, i['combination'][1] + '.json')
if osp.exists(this_result_local_file_path):
continue
with open(this_result_local_file_path, 'w') as f:
json.dump(this_result, f, ensure_ascii=False, indent=4)
f.close()
return existing_results_list
# subjective check
else:
model_list = [model for model in cfg['models']]
judge_list = [judge_model for judge_model in cfg['judge_models']]
model_pair_list = [[
deal_with_judge_model_abbr(model, judge_model)
for judge_model in judge_list
] for model in model_list]
dataset_list = [[
dataset_abbr_from_cfg(dataset),
[dataset_abbr_from_cfg(base) for base in dataset['base_models']]
] if 'base_models' in dataset.keys() else
[dataset_abbr_from_cfg(dataset), ['']]
for dataset in cfg['datasets']]
existing_results_list = []
result_local_path = osp.join(cfg['work_dir'], 'results')
if not osp.exists(result_local_path):
os.makedirs(result_local_path)
for pair_of_dataset_and_base in dataset_list:
dataset, base_list = pair_of_dataset_and_base[
0], pair_of_dataset_and_base[1]
for model_pair_sub_list in model_pair_list:
result_file_path_list_origin = []
for model_pair in model_pair_sub_list:
model_result = model_abbr_from_cfg(model_pair)
for base_model in base_list:
base_model_name = base_model
if base_model_name != '':
base_model_name += '_'
result_file_path_list_origin.append(
osp.join(station_path, dataset,
base_model_name + model_result + '.json'))
result_file_path_list = [
result_file_path
for result_file_path in result_file_path_list_origin
if osp.exists(result_file_path)
]
model = model_abbr_from_cfg(model_pair_sub_list[0][0])
# save all parts of results to local
for result_file_path in result_file_path_list:
with open(result_file_path, 'r') as f:
this_result = json.load(f)['results']
f.close()
this_result_local_path = osp.join(
result_local_path,
osp.splitext(osp.basename(result_file_path))[0])
if not osp.exists(this_result_local_path):
os.makedirs(this_result_local_path)
this_result_local_file_path = osp.join(
this_result_local_path, dataset + '.json')
if osp.exists(this_result_local_file_path):
continue
with open(this_result_local_file_path, 'w') as f:
json.dump(this_result, f, ensure_ascii=False, indent=4)
f.close()
# check whether complete
if len(result_file_path_list) == len(
result_file_path_list_origin):
print('find complete results of {} with {} at station'.
format(model, dataset))
existing_results_list.append({
'combination': [model, dataset],
'file':
result_file_path_list
})
else:
print('results of {} with {} at station is not complete'.
format(model, dataset))
return existing_results_list
def find_files_by_regex(directory, pattern):
regex = re.compile(pattern)
matched_files = []
for filename in os.listdir(directory):
if regex.match(filename):
matched_files.append(filename)
return matched_files
def check_filenames(x, filenames):
if not filenames:
return False
single_pattern = re.compile(rf'^{re.escape(x)}\.json$')
numbered_pattern = re.compile(rf'^{re.escape(x)}_(\d+)\.json$')
is_single = all(single_pattern.match(name) for name in filenames)
is_numbered = all(numbered_pattern.match(name) for name in filenames)
if not (is_single or is_numbered):
return False
if is_single:
return len(filenames) == 1
if is_numbered:
numbers = []
for name in filenames:
match = numbered_pattern.match(name)
if match:
numbers.append(int(match.group(1)))
if sorted(numbers) != list(range(len(numbers))):
return False
return True

View File

@ -37,7 +37,7 @@ rouge_score
sacrebleu
scikit_learn==1.5.0
seaborn
sentence_transformers==2.2.2
sentence_transformers
tabulate
tiktoken
timeout_decorator