update eng doc for multi-run and g-pass

This commit is contained in:
jnanliu 2025-02-25 09:15:08 +00:00
parent 8ebb8a5d11
commit c1fe59d015
2 changed files with 36 additions and 1 deletions

View File

@ -81,3 +81,38 @@ datasets += cmnli_datasets
Users can choose different abilities, different datasets and different evaluation methods configuration files to build the part of the dataset in the evaluation script according to their needs.
For information on how to start an evaluation task and how to evaluate self-built datasets, please refer to the relevant documents.
### Multiple Evaluations on the Dataset
In the dataset configuration, you can set the parameter `n` to perform multiple evaluations on the same dataset and return the average metrics, for example:
```python
afqmc_datasets = [
dict(
abbr="afqmc-dev",
type=AFQMCDatasetV2,
path="./data/CLUE/AFQMC/dev.json",
n=10, # Perform 10 evaluations
reader_cfg=afqmc_reader_cfg,
infer_cfg=afqmc_infer_cfg,
eval_cfg=afqmc_eval_cfg,
),
]
```
> [!TIP]
> Additionally, for binary evaluation metrics (such as accuracy, pass-rate, etc.), you can also set the parameter `k` in conjunction with `n` for [G-Pass@$k$](http://arxiv.org/abs/2412.13147) evaluation. The formula for G-Pass@$k$ is: $$\text{G-Pass@}k_\tau=\mathbb{E}_{\text{Data}}\left[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right],$$ where $n$ is the number of evaluations, and $c$ is the number of times that passed or were correct out of $n$ runs. An example configuration is as follows:
```python
aime2024_datasets = [
dict(
abbr='aime2024',
type=Aime2024Dataset,
path='opencompass/aime2024',
k=[2, 4], # Return results for G-Pass@2 and G-Pass@4
n=12, # 12 evaluations
...
)
]
```

View File

@ -101,7 +101,7 @@ afqmc_datasets = [
```
> [!TIP]
> 另外对于二值评测指标例如accuracypass-rate等还可以通过设置参数`k`配合`n`进行[G-Pass@$k$](http://arxiv.org/abs/2412.13147)评测。G-Pass@$k$计算公式为:$$\text{G-Pass@}k_\tau=\mathbb{E}_{\text{Data}}\left\[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right],$$ 其中 $n$ 为评测次数, $c$ 为 $n$ 次运行中通过或正确的次数。配置例子如下:
> 另外对于二值评测指标例如accuracypass-rate等还可以通过设置参数`k`配合`n`进行[G-Pass@$k$](http://arxiv.org/abs/2412.13147)评测。G-Pass@$k$计算公式为:$$\text{G-Pass@}k_\tau=\mathbb{E}_{\text{Data}}\left[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right],$$ 其中 $n$ 为评测次数, $c$ 为 $n$ 次运行中通过或正确的次数。配置例子如下:
```python
aime2024_datasets = [