mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00
update eng doc for multi-run and g-pass
This commit is contained in:
parent
8ebb8a5d11
commit
c1fe59d015
@ -81,3 +81,38 @@ datasets += cmnli_datasets
|
||||
Users can choose different abilities, different datasets and different evaluation methods configuration files to build the part of the dataset in the evaluation script according to their needs.
|
||||
|
||||
For information on how to start an evaluation task and how to evaluate self-built datasets, please refer to the relevant documents.
|
||||
|
||||
|
||||
### Multiple Evaluations on the Dataset
|
||||
|
||||
In the dataset configuration, you can set the parameter `n` to perform multiple evaluations on the same dataset and return the average metrics, for example:
|
||||
|
||||
```python
|
||||
afqmc_datasets = [
|
||||
dict(
|
||||
abbr="afqmc-dev",
|
||||
type=AFQMCDatasetV2,
|
||||
path="./data/CLUE/AFQMC/dev.json",
|
||||
n=10, # Perform 10 evaluations
|
||||
reader_cfg=afqmc_reader_cfg,
|
||||
infer_cfg=afqmc_infer_cfg,
|
||||
eval_cfg=afqmc_eval_cfg,
|
||||
),
|
||||
]
|
||||
```
|
||||
|
||||
> [!TIP]
|
||||
> Additionally, for binary evaluation metrics (such as accuracy, pass-rate, etc.), you can also set the parameter `k` in conjunction with `n` for [G-Pass@$k$](http://arxiv.org/abs/2412.13147) evaluation. The formula for G-Pass@$k$ is: $$\text{G-Pass@}k_\tau=\mathbb{E}_{\text{Data}}\left[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right],$$ where $n$ is the number of evaluations, and $c$ is the number of times that passed or were correct out of $n$ runs. An example configuration is as follows:
|
||||
|
||||
```python
|
||||
aime2024_datasets = [
|
||||
dict(
|
||||
abbr='aime2024',
|
||||
type=Aime2024Dataset,
|
||||
path='opencompass/aime2024',
|
||||
k=[2, 4], # Return results for G-Pass@2 and G-Pass@4
|
||||
n=12, # 12 evaluations
|
||||
...
|
||||
)
|
||||
]
|
||||
```
|
@ -101,7 +101,7 @@ afqmc_datasets = [
|
||||
```
|
||||
|
||||
> [!TIP]
|
||||
> 另外,对于二值评测指标(例如accuracy,pass-rate等),还可以通过设置参数`k`配合`n`进行[G-Pass@$k$](http://arxiv.org/abs/2412.13147)评测。G-Pass@$k$计算公式为:$$\text{G-Pass@}k_\tau=\mathbb{E}_{\text{Data}}\left\[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right],$$ 其中 $n$ 为评测次数, $c$ 为 $n$ 次运行中通过或正确的次数。配置例子如下:
|
||||
> 另外,对于二值评测指标(例如accuracy,pass-rate等),还可以通过设置参数`k`配合`n`进行[G-Pass@$k$](http://arxiv.org/abs/2412.13147)评测。G-Pass@$k$计算公式为:$$\text{G-Pass@}k_\tau=\mathbb{E}_{\text{Data}}\left[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right],$$ 其中 $n$ 为评测次数, $c$ 为 $n$ 次运行中通过或正确的次数。配置例子如下:
|
||||
|
||||
```python
|
||||
aime2024_datasets = [
|
||||
|
Loading…
Reference in New Issue
Block a user