OpenCompass/docs/en/user_guides/datasets.md

# Configure Datasets

This tutorial mainly focuses on selecting datasets supported by OpenCompass and preparing their configs files. Please make sure you have downloaded the datasets following the steps in [Dataset Preparation](../get_started/installation.md#dataset-preparation).

## Directory Structure of Dataset Configuration Files

First, let's introduce the structure under the `configs/datasets` directory in OpenCompass, as shown below:

```
configs/datasets/
├── agieval
├── apps
├── ARC_c
├── ...
├── CLUE_afqmc  # dataset
│   ├── CLUE_afqmc_gen_901306.py  # different version of config
│   ├── CLUE_afqmc_gen.py
│   ├── CLUE_afqmc_ppl_378c5b.py
│   ├── CLUE_afqmc_ppl_6507d7.py
│   ├── CLUE_afqmc_ppl_7b0c1e.py
│   └── CLUE_afqmc_ppl.py
├── ...
├── XLSum
├── Xsum
└── z_bench
```

In the `configs/datasets` directory structure, we flatten all datasets directly, and there are multiple dataset configurations within the corresponding folders for each dataset.

The naming of the dataset configuration file is made up of `{dataset name}_{evaluation method}_{prompt version number}.py`. For example, `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`, this configuration file is the `CLUE_afqmc` dataset under the Chinese universal ability, the corresponding evaluation method is `gen`, i.e., generative evaluation, and the corresponding prompt version number is `db509b`; similarly, `CLUE_afqmc_ppl_00b348.py` indicates that the evaluation method is `ppl`, i.e., discriminative evaluation, and the prompt version number is `00b348`.

In addition, files without a version number, such as: `CLUE_afqmc_gen.py`, point to the latest prompt configuration file of that evaluation method, which is usually the most accurate prompt.

## Dataset Selection

In each dataset configuration file, the dataset will be defined in the `{}_datasets` variable, such as `afqmc_datasets` in `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`.

```python
afqmc_datasets = [
    dict(
        abbr="afqmc-dev",
        type=AFQMCDatasetV2,
        path="./data/CLUE/AFQMC/dev.json",
        reader_cfg=afqmc_reader_cfg,
        infer_cfg=afqmc_infer_cfg,
        eval_cfg=afqmc_eval_cfg,
    ),
]
```

And `cmnli_datasets` in `CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py`.

```python
cmnli_datasets = [
    dict(
        type=HFDataset,
        abbr='cmnli',
        path='json',
        split='train',
        data_files='./data/CLUE/cmnli/cmnli_public/dev.json',
        reader_cfg=cmnli_reader_cfg,
        infer_cfg=cmnli_infer_cfg,
        eval_cfg=cmnli_eval_cfg)
]
```

Take these two datasets as examples. If users want to evaluate these two datasets at the same time, they can create a new configuration file in the `configs` directory. We use the import mechanism in the `mmengine` configuration to build the part of the dataset parameters in the evaluation script, as shown below:

```python
from mmengine.config import read_base

with read_base():
    from .datasets.CLUE_afqmc.CLUE_afqmc_gen_db509b import afqmc_datasets
    from .datasets.CLUE_cmnli.CLUE_cmnli_ppl_b78ad4 import cmnli_datasets

datasets = []
datasets += afqmc_datasets
datasets += cmnli_datasets
```

Users can choose different abilities, different datasets and different evaluation methods configuration files to build the part of the dataset in the evaluation script according to their needs.

For information on how to start an evaluation task and how to evaluate self-built datasets, please refer to the relevant documents.

### Multiple Evaluations on the Dataset

In the dataset configuration, you can set the parameter `n` to perform multiple evaluations on the same dataset and return the average metrics, for example:

```python
afqmc_datasets = [
    dict(
        abbr="afqmc-dev",
        type=AFQMCDatasetV2,
        path="./data/CLUE/AFQMC/dev.json",
        n=10, # Perform 10 evaluations
        reader_cfg=afqmc_reader_cfg,
        infer_cfg=afqmc_infer_cfg,
        eval_cfg=afqmc_eval_cfg,
    ),
]

```

Additionally, for binary evaluation metrics (such as accuracy, pass-rate, etc.), you can also set the parameter `k` in conjunction with `n` for [G-Pass@k](http://arxiv.org/abs/2412.13147) evaluation. The formula for G-Pass@k is:

```{math}
\text{G-Pass@}k_\tau=E_{\text{Data}}\left[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right], 
```

where $n$ is the number of evaluations, and $c$ is the number of times that passed or were correct out of $n$ runs. An example configuration is as follows:

```python
aime2024_datasets = [
    dict(
        abbr='aime2024',
        type=Aime2024Dataset,
        path='opencompass/aime2024',
        k=[2, 4], # Return results for G-Pass@2 and G-Pass@4
        n=12, # 12 evaluations
        ...
    )
]
```
[Docs] Polish docs (#43) * [Docs] Polish docs * apply suggestions * apply suggestions 2023-07-13 09:07:53 +08:00			`# Configure Datasets`
initial commit 2023-07-04 21:34:55 +08:00
[Docs] update invalid link in docs (#499) 2023-10-25 13:15:42 +08:00			`This tutorial mainly focuses on selecting datasets supported by OpenCompass and preparing their configs files. Please make sure you have downloaded the datasets following the steps in [Dataset Preparation](../get_started/installation.md#dataset-preparation).`
initial commit 2023-07-04 21:34:55 +08:00
			`## Directory Structure of Dataset Configuration Files`

			First, let's introduce the structure under the `configs/datasets` directory in OpenCompass, as shown below:

			```
			`configs/datasets/`
[Docs] add en docs (#15) * add en docs * update --------- Co-authored-by: gaotongxiao <gaotongxiao@gmail.com> 2023-07-06 12:58:44 +08:00			`├── agieval`
			`├── apps`
			`├── ARC_c`
			`├── ...`
			`├── CLUE_afqmc # dataset`
			`│ ├── CLUE_afqmc_gen_901306.py # different version of config`
			`│ ├── CLUE_afqmc_gen.py`
			`│ ├── CLUE_afqmc_ppl_378c5b.py`
			`│ ├── CLUE_afqmc_ppl_6507d7.py`
			`│ ├── CLUE_afqmc_ppl_7b0c1e.py`
			`│ └── CLUE_afqmc_ppl.py`
			`├── ...`
			`├── XLSum`
			`├── Xsum`
			`└── z_bench`
initial commit 2023-07-04 21:34:55 +08:00			```

[Docs] add en docs (#15) * add en docs * update --------- Co-authored-by: gaotongxiao <gaotongxiao@gmail.com> 2023-07-06 12:58:44 +08:00			In the `configs/datasets` directory structure, we flatten all datasets directly, and there are multiple dataset configurations within the corresponding folders for each dataset.
initial commit 2023-07-04 21:34:55 +08:00
[Docs] add en docs (#15) * add en docs * update --------- Co-authored-by: gaotongxiao <gaotongxiao@gmail.com> 2023-07-06 12:58:44 +08:00			The naming of the dataset configuration file is made up of `{dataset name}_{evaluation method}_{prompt version number}.py`. For example, `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`, this configuration file is the `CLUE_afqmc` dataset under the Chinese universal ability, the corresponding evaluation method is `gen`, i.e., generative evaluation, and the corresponding prompt version number is `db509b`; similarly, `CLUE_afqmc_ppl_00b348.py` indicates that the evaluation method is `ppl`, i.e., discriminative evaluation, and the prompt version number is `00b348`.
initial commit 2023-07-04 21:34:55 +08:00
			In addition, files without a version number, such as: `CLUE_afqmc_gen.py`, point to the latest prompt configuration file of that evaluation method, which is usually the most accurate prompt.

			`## Dataset Selection`

[Docs] add en docs (#15) * add en docs * update --------- Co-authored-by: gaotongxiao <gaotongxiao@gmail.com> 2023-07-06 12:58:44 +08:00			In each dataset configuration file, the dataset will be defined in the `{}_datasets` variable, such as `afqmc_datasets` in `CLUE_afqmc/CLUE_afqmc_gen_db509b.py`.
initial commit 2023-07-04 21:34:55 +08:00
			```python
			`afqmc_datasets = [`
			`dict(`
			`abbr="afqmc-dev",`
[Feature] Support ModelScope datasets (#1289) * add ceval, gsm8k modelscope surpport * update race, mmlu, arc, cmmlu, commonsenseqa, humaneval and unittest * update bbh, flores, obqa, siqa, storycloze, summedits, winogrande, xsum datasets * format file * format file * update dataset format * support ms_dataset * udpate dataset for modelscope support * merge myl_dev and update test_ms_dataset * udpate dataset for modelscope support * update readme * update eval_api_zhipu_v2 * remove unused code * add get_data_path function * update readme * remove tydiqa japanese subset * add ceval, gsm8k modelscope surpport * update race, mmlu, arc, cmmlu, commonsenseqa, humaneval and unittest * update bbh, flores, obqa, siqa, storycloze, summedits, winogrande, xsum datasets * format file * format file * update dataset format * support ms_dataset * udpate dataset for modelscope support * merge myl_dev and update test_ms_dataset * update readme * udpate dataset for modelscope support * update eval_api_zhipu_v2 * remove unused code * add get_data_path function * remove tydiqa japanese subset * update util * remove .DS_Store * fix md format * move util into package * update docs/get_started.md * restore eval_api_zhipu_v2.py, add environment setting * Update dataset * Update * Update * Update * Update --------- Co-authored-by: Yun lin <yunlin@U-Q9X2K4QV-1904.local> Co-authored-by: Yunnglin <mao.looper@qq.com> Co-authored-by: Yun lin <yunlin@laptop.local> Co-authored-by: Yunnglin <maoyl@smail.nju.edu.cn> Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn> 2024-07-29 13:48:32 +08:00			`type=AFQMCDatasetV2,`
initial commit 2023-07-04 21:34:55 +08:00			`path="./data/CLUE/AFQMC/dev.json",`
			`reader_cfg=afqmc_reader_cfg,`
			`infer_cfg=afqmc_infer_cfg,`
			`eval_cfg=afqmc_eval_cfg,`
			`),`
			`]`
			```

[Docs] add en docs (#15) * add en docs * update --------- Co-authored-by: gaotongxiao <gaotongxiao@gmail.com> 2023-07-06 12:58:44 +08:00			And `cmnli_datasets` in `CLUE_cmnli/CLUE_cmnli_ppl_b78ad4.py`.
initial commit 2023-07-04 21:34:55 +08:00
			```python
			`cmnli_datasets = [`
			`dict(`
			`type=HFDataset,`
			`abbr='cmnli',`
			`path='json',`
			`split='train',`
			`data_files='./data/CLUE/cmnli/cmnli_public/dev.json',`
			`reader_cfg=cmnli_reader_cfg,`
			`infer_cfg=cmnli_infer_cfg,`
			`eval_cfg=cmnli_eval_cfg)`
			`]`
			```

			Take these two datasets as examples. If users want to evaluate these two datasets at the same time, they can create a new configuration file in the `configs` directory. We use the import mechanism in the `mmengine` configuration to build the part of the dataset parameters in the evaluation script, as shown below:

			```python
			`from mmengine.config import read_base`

			`with read_base():`
			`from .datasets.CLUE_afqmc.CLUE_afqmc_gen_db509b import afqmc_datasets`
			`from .datasets.CLUE_cmnli.CLUE_cmnli_ppl_b78ad4 import cmnli_datasets`

			`datasets = []`
			`datasets += afqmc_datasets`
			`datasets += cmnli_datasets`
			```

			`Users can choose different abilities, different datasets and different evaluation methods configuration files to build the part of the dataset in the evaluation script according to their needs.`

			`For information on how to start an evaluation task and how to evaluate self-built datasets, please refer to the relevant documents.`
[Feature] Support Dataset Repeat and G-Pass Compute for Each Evaluator (#1886) * support dataset repeat and g-pass compute for each evaluator * fix pre-commit errors * delete print * delete gpassk_evaluator and fix potential errors * change `repeat` to `n` * fix `repeat` to `n` in openicl_eval * update doc for multi-run and g-pass * update latex equation in doc * update eng doc for multi-run and g-pass * update datasets.md * update datasets.md * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation in zh_cn user_guides * mmodify pre-commit-zh-cn * recover pre-commit and edit math expr in doc * del [TIP] * del cite tag in doc * del extract_model param in livemathbench config 2025-02-26 19:43:12 +08:00
			`### Multiple Evaluations on the Dataset`

			In the dataset configuration, you can set the parameter `n` to perform multiple evaluations on the same dataset and return the average metrics, for example:

			```python
			`afqmc_datasets = [`
			`dict(`
			`abbr="afqmc-dev",`
			`type=AFQMCDatasetV2,`
			`path="./data/CLUE/AFQMC/dev.json",`
			`n=10, # Perform 10 evaluations`
			`reader_cfg=afqmc_reader_cfg,`
			`infer_cfg=afqmc_infer_cfg,`
			`eval_cfg=afqmc_eval_cfg,`
			`),`
			`]`

			```

			Additionally, for binary evaluation metrics (such as accuracy, pass-rate, etc.), you can also set the parameter `k` in conjunction with `n` for [G-Pass@k](http://arxiv.org/abs/2412.13147) evaluation. The formula for G-Pass@k is:

			```{math}
			`\text{G-Pass@}k_\tau=E_{\text{Data}}\left[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right],`
			```

			`where $n$ is the number of evaluations, and $c$ is the number of times that passed or were correct out of $n$ runs. An example configuration is as follows:`

			```python
			`aime2024_datasets = [`
			`dict(`
			`abbr='aime2024',`
			`type=Aime2024Dataset,`
			`path='opencompass/aime2024',`
			`k=[2, 4], # Return results for G-Pass@2 and G-Pass@4`
			`n=12, # 12 evaluations`
			`...`
			`)`
			`]`
			```