[Feature] Support Dataset Repeat and G-Pass Compute for Each Evaluator (#1886)

* support dataset repeat and g-pass compute for each evaluator * fix pre-commit errors * delete print * delete gpassk_evaluator and fix potential errors * change `repeat` to `n` * fix `repeat` to `n` in openicl_eval * update doc for multi-run and g-pass * update latex equation in doc * update eng doc for multi-run and g-pass * update datasets.md * update datasets.md * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation in zh_cn user_guides * mmodify pre-commit-zh-cn * recover pre-commit and edit math expr in doc * del [TIP] * del cite tag in doc * del extract_model param in livemathbench config
2025-05-30 16:03:24 +08:00 · 2025-02-26 19:43:12 +08:00 · 2025-02-26 19:43:12 +08:00 · 73c80953c6
commit 73c80953c6
parent 6042b88e58
11 changed files with 300 additions and 250 deletions
--- a/docs/en/user_guides/datasets.md
+++ b/docs/en/user_guides/datasets.md
@ -81,3 +81,43 @@ datasets += cmnli_datasets
 Users can choose different abilities, different datasets and different evaluation methods configuration files to build the part of the dataset in the evaluation script according to their needs.

 For information on how to start an evaluation task and how to evaluate self-built datasets, please refer to the relevant documents.
+
+### Multiple Evaluations on the Dataset
+
+In the dataset configuration, you can set the parameter `n` to perform multiple evaluations on the same dataset and return the average metrics, for example:
+
+```python
+afqmc_datasets = [
+    dict(
+        abbr="afqmc-dev",
+        type=AFQMCDatasetV2,
+        path="./data/CLUE/AFQMC/dev.json",
+        n=10, # Perform 10 evaluations
+        reader_cfg=afqmc_reader_cfg,
+        infer_cfg=afqmc_infer_cfg,
+        eval_cfg=afqmc_eval_cfg,
+    ),
+]
+
+```
+
+Additionally, for binary evaluation metrics (such as accuracy, pass-rate, etc.), you can also set the parameter `k` in conjunction with `n` for [G-Pass@k](http://arxiv.org/abs/2412.13147) evaluation. The formula for G-Pass@k is:
+
+```{math}
+\text{G-Pass@}k_\tau=E_{\text{Data}}\left[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right], 
+```
+
+where $n$ is the number of evaluations, and $c$ is the number of times that passed or were correct out of $n$ runs. An example configuration is as follows:
+
+```python
+aime2024_datasets = [
+    dict(
+        abbr='aime2024',
+        type=Aime2024Dataset,
+        path='opencompass/aime2024',
+        k=[2, 4], # Return results for G-Pass@2 and G-Pass@4
+        n=12, # 12 evaluations
+        ...
+    )
+]
+```
--- a/docs/zh_cn/user_guides/datasets.md
+++ b/docs/zh_cn/user_guides/datasets.md
@ -81,3 +81,42 @@ datasets += cmnli_datasets
 用户可以根据需要，选择不同能力不同数据集以及不同评测方式的配置文件来构建评测脚本中数据集的部分。

 有关如何启动评测任务，以及如何评测自建数据集可以参考相关文档。
+
+### 数据集多次评测
+
+在数据集配置中可以通过设置参数`n`来对同一数据集进行多次评测，最终返回平均指标，例如：
+
+```python
+afqmc_datasets = [
+    dict(
+        abbr="afqmc-dev",
+        type=AFQMCDatasetV2,
+        path="./data/CLUE/AFQMC/dev.json",
+        n=10, # 进行10次评测
+        reader_cfg=afqmc_reader_cfg,
+        infer_cfg=afqmc_infer_cfg,
+        eval_cfg=afqmc_eval_cfg,
+    ),
+]
+```
+
+另外，对于二值评测指标（例如accuracy，pass-rate等），还可以通过设置参数`k`配合`n`进行[G-Pass@k](http://arxiv.org/abs/2412.13147)评测。G-Pass@k计算公式为：
+
+```{math}
+\text{G-Pass@}k_\tau=E_{\text{Data}}\left[ \sum_{j=\lceil \tau \cdot k \rceil}^c \frac{{c \choose j} \cdot {n - c \choose k - j}}{{n \choose k}} \right], 
+```
+
+其中 $n$ 为评测次数, $c$ 为 $n$ 次运行中通过或正确的次数。配置例子如下：
+
+```python
+aime2024_datasets = [
+    dict(
+        abbr='aime2024',
+        type=Aime2024Dataset,
+        path='opencompass/aime2024',
+        k=[2, 4], # 返回 G-Pass@2和G-Pass@4的结果
+        n=12, # 12次评测
+        ...
+    )
+]
+```
--- a/opencompass/configs/datasets/livemathbench/livemathbench_gen_9befbf.py
+++ b/opencompass/configs/datasets/livemathbench/livemathbench_gen_9befbf.py
@ -9,7 +9,7 @@ livemathbench_dataset = dict(
    type=LiveMathBenchDataset,
    path='',
    k=16,
-    replication=3,
+    n=48,
    dataset_splits=['CNMO', 'CCEE', 'AMC', 'WLPMC'],
    dataset_languages=['cn', 'en'],
    cot=True,
@ -38,13 +38,7 @@ livemathbench_dataset = dict(
        evaluator=dict(
            type=LiveMathBenchEvaluator,
            model_name='',
-            url=[],
-            use_extract_model=False,
-            extract_url=[],
-            extract_model_name='',
-            k=[4, 8, 16],
-            replication=3,
-            thresholds=[0.0, 0.25, 0.5, 0.75, 1.0]
+            url=[]
        )
    )
 )
--- a/opencompass/configs/datasets/livemathbench/livemathbench_greedy_gen_9befbf.py
+++ b/opencompass/configs/datasets/livemathbench/livemathbench_greedy_gen_9befbf.py
@ -9,7 +9,7 @@ livemathbench_dataset = dict(
    type=LiveMathBenchDataset,
    path='',
    k=1,
-    replication=1,
+    n=1,
    dataset_splits=['CNMO', 'CCEE', 'AMC', 'WLPMC'],
    dataset_languages=['cn', 'en'],
    cot=True,
@ -38,13 +38,7 @@ livemathbench_dataset = dict(
        evaluator=dict(
            type=LiveMathBenchEvaluator,
            model_name='',
-            url=[],
-            use_extract_model=False,
-            extract_url=[],
-            extract_model_name='',
-            k=[1],
-            replication=1,
-            thresholds=[0.0]
+            url=[]
        )
    )
 )
--- a/opencompass/datasets/base.py
+++ b/opencompass/datasets/base.py
@ -1,4 +1,5 @@
-from typing import Dict, Optional, Union
+from copy import deepcopy
+from typing import Dict, List, Optional, Union

 from datasets import Dataset, DatasetDict

@ -7,8 +8,39 @@ from opencompass.openicl import DatasetReader

 class BaseDataset:

-    def __init__(self, reader_cfg: Optional[Dict] = {}, **kwargs):
-        self.dataset = self.load(**kwargs)
+    def __init__(self,
+                 reader_cfg: Optional[Dict] = {},
+                 k: Union[int, List[int]] = 1,
+                 n: int = 1,
+                 **kwargs):
+        abbr = kwargs.pop('abbr', 'dataset')
+        dataset = self.load(**kwargs)
+        # maybe duplicate
+        assert (max(k) if isinstance(k, List) else
+                k) <= n, 'Maximum value of `k` must less than or equal to `n`'
+        if isinstance(dataset, Dataset):
+            examples = []
+            for idx, example in enumerate(dataset):
+                if 'subdivision' not in example:
+                    example['subdivision'] = abbr
+                if 'idx' not in example:
+                    example['idx'] = idx
+                examples.append(example)
+            examples = sum([deepcopy(examples) for _ in range(n)], [])
+            self.dataset = Dataset.from_list(examples)
+        else:
+            self.dataset = DatasetDict()
+            for key in dataset:
+                examples = []
+                for idx, example in enumerate(dataset[key]):
+                    if 'subdivision' not in example:
+                        example['subdivision'] = f'{abbr}_{key}'
+                    if 'idx' not in example:
+                        example['idx'] = idx
+                    examples.append(example)
+                print(abbr, key, len(examples))
+                examples = sum([deepcopy(examples) for _ in range(n)], [])
+                self.dataset[key] = Dataset.from_list(examples)
        self._init_reader(**reader_cfg)

    def _init_reader(self, **kwargs):
--- a/opencompass/datasets/livemathbench/livemathbench.py
+++ b/opencompass/datasets/livemathbench/livemathbench.py
@ -1,11 +1,9 @@
 import os
 import warnings
-from collections import OrderedDict
 from concurrent.futures import ThreadPoolExecutor, as_completed
-from copy import deepcopy
 from functools import partial
 from itertools import product
-from typing import Any, Callable, Dict, List, Union
+from typing import Any, Callable, Dict, List

 import jsonlines
 import mmengine
@ -14,7 +12,7 @@ from datasets import Dataset, load_dataset

 from opencompass.datasets.math import MATHAgentEvaluator, math_postprocess_v2
 from opencompass.models import OpenAISDK
-from opencompass.openicl.icl_evaluator import GPassKEvaluator
+from opencompass.openicl.icl_evaluator import BaseEvaluator
 from opencompass.openicl.icl_inferencer.icl_base_inferencer import \
    dump_results_dict
 from opencompass.registry import ICL_EVALUATORS, LOAD_DATASET, MODELS
@ -31,8 +29,6 @@ class LiveMathBenchDataset(BaseDataset):

    @staticmethod
    def load(path: str,
-             k: Union[int, List[int]],
-             replication: int,
             dataset_splits: List[str] = [
                 'CNMO',
                 'CCEE',
@ -104,17 +100,13 @@ class LiveMathBenchDataset(BaseDataset):
                                  ('' if 'options' not in example else
                                   ' '.join(example['options']))),
                })
-                max_k = k if isinstance(k, int) else max(k)
-                for idx in range(max_k * replication):
-                    duplicated_example = deepcopy(example)
-                    duplicated_example.update({'replication_idx': idx})
-                    dataset.append(duplicated_example)
+                dataset.append(example)

        return Dataset.from_list(dataset)


@ICL_EVALUATORS.register_module()
-class LiveMathBenchEvaluator(GPassKEvaluator):
+class LiveMathBenchEvaluator(BaseEvaluator):
    api_meta_template = dict(round=[
        dict(role='HUMAN', api_role='HUMAN'),
        dict(role='BOT', api_role='BOT', generate=True),
@ -125,11 +117,8 @@ class LiveMathBenchEvaluator(GPassKEvaluator):
                 url,
                 use_extract_model=False,
                 extract_url=[],
-                 extract_model_name='',
-                 k: Union[int, List[int]] = 16,
-                 replication: int = 3,
-                 thresholds: List[float] = [0.0, 0.25, 0.5, 0.75, 1.0]):
-        super().__init__(k, replication, thresholds)
+                 extract_model_name=''):
+        super().__init__()

        if isinstance(url, str):
            url = [url]
@ -310,55 +299,18 @@ class LiveMathBenchEvaluator(GPassKEvaluator):
    def preprocess(self, predictions, references, test_set):
        return self.judge(predictions, references, test_set)

-    def group(self, predictions, labels, test_set):
-        example2replications = {}
-        for example, label, prediction in zip(test_set, labels, predictions):
-            example_abbr = f"{example['subdivision']}_{example['idx']}"
-            if example_abbr not in example2replications:
-                example2replications[example_abbr] = []
-            example.update({'prediction': prediction, 'label': label})
-            example2replications[example_abbr].append(example)
-        for _, replications in example2replications.items():
-            assert len(replications) == self.n, print(len(replications),
-                                                      self.n)
-        return example2replications
+    def score(self, predictions, references, test_set) -> Dict[str, Any]:
+        labels = self.preprocess(predictions, references, test_set)
+        results = {'accuracy': 100 * np.mean(labels), 'details': []}

-    def reduce(self, details) -> Dict[str, Any]:
-        """Aggregate the overall metrics.
+        for pred, ref, label in zip(predictions, references, labels):
+            results['details'].append({
+                'pred': pred,
+                'ref': ref,
+                'correct': label
+            })

-        Return:
-            A dict contains overall metrics, like:
-            {'details': details for each example, 'G-Pass@16': xxx}
-        """
-        g_passk_details = OrderedDict()
-        g_passk_details['details'] = details
-
-        all_dataset = set([detail['subdivision'] for detail in details])
-
-        for k in self.k:
-            for subdivision in sorted(list(all_dataset)):
-                for threshold in self.thresholds:
-                    g_passk_details[
-                        f'{subdivision}/G-Pass@{k}_{threshold}'] = \
-                            100. * np.mean(
-                            [
-                                detail[f'G-Pass@{k}_{threshold}']
-                                for detail in details
-                                if detail['subdivision'] == subdivision
-                            ])
-                g_passk_details[f'{subdivision}/mG-Pass@{k}'] = 100. * np.mean(
-                    [
-                        detail[f'mG-Pass@{k}'] for detail in details
-                        if detail['subdivision'] == subdivision
-                    ])
-
-            for threshold in self.thresholds:
-                g_passk_details[f'G-Pass@{k}_{threshold}'] = 100. * np.mean(
-                    [detail[f'G-Pass@{k}_{threshold}'] for detail in details])
-            g_passk_details[f'mG-Pass@{k}'] = 100. * np.mean(
-                [detail[f'mG-Pass@{k}'] for detail in details])
-
-        return g_passk_details
+        return results


 class LiveMathBenchOutputHandler:
--- a/opencompass/openicl/icl_evaluator/init.py
+++ b/opencompass/openicl/icl_evaluator/init.py
@ -4,7 +4,6 @@ from .icl_base_evaluator import BaseEvaluator  # noqa
 from .icl_bpc_evaluator import BPCEvaluator  # noqa
 from .icl_circular_evaluator import CircularEvaluator  # noqa
 from .icl_em_evaluator import EMEvaluator  # noqa
-from .icl_gpassk_evaluator import GPassKEvaluator  # noqa
 from .icl_hf_evaluator import *  # noqa
 from .icl_jieba_rouge_evaluator import JiebaRougeEvaluator  # noqa
 from .icl_misc_evaluator import AverageInferencePPLEvaluator  # noqa
--- a/opencompass/openicl/icl_evaluator/icl_base_evaluator.py
+++ b/opencompass/openicl/icl_evaluator/icl_base_evaluator.py
@ -1,4 +1,39 @@
 """Base Evaluator."""
+from collections import OrderedDict
+from copy import deepcopy
+from typing import Any, Dict, Iterable, List, Union
+
+import numpy as np
+from datasets import Dataset
+from scipy.stats import hypergeom
+
+
+def compute_pass_at_k(n, c, k):
+    if n - c < k:
+        return 1.0
+    return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
+
+
+def _compute_g_pass_at_k(n, c, k, m):
+    if m > min(c, k) or k > n or c < 0 or n <= 0 or m < 0:
+        return 0.0
+    return hypergeom.sf(m - 1, n, c, k)
+
+
+def compute_g_pass_at_k(n, c, k, t):
+    m = max(int(np.ceil(k * t)), 1)
+    return _compute_g_pass_at_k(n, c, k, m)
+
+
+def compute_mg_pass_at_k(n, c, k):
+    l, r = int(np.ceil(k * 0.5)), k
+
+    mg_pass_at_k = 0.0
+    for i in range(l + 1, r + 1):
+        mg_pass_at_k += _compute_g_pass_at_k(n, c, k, i)
+    mg_pass_at_k = 2 * mg_pass_at_k / k
+
+    return mg_pass_at_k


 class BaseEvaluator:
@ -6,6 +41,130 @@ class BaseEvaluator:
    def __init__(self) -> None:
        pass

+    @property
+    def output_dir(self):
+        # please see opencompass/opencompass/tasks/openicl_eval.py Line 197-200
+        return self._out_dir
+
+    def group(self, n: int, details: List[Dict[str, Any]],
+              test_set: Dataset) -> Dict[str, Any]:
+        example2replications = {}
+        for detail, example in zip(details, test_set):
+            example_abbr = f"{example['subdivision']}_{example['idx']}"
+            if example_abbr not in example2replications:
+                example2replications[example_abbr] = []
+            example.update({'detail': detail})
+            example2replications[example_abbr].append(example)
+        for _, replications in example2replications.items():
+            assert len(replications) == n, print(len(replications), n)
+        return example2replications
+
+    def reduce(self, details: List[Dict[str, Any]]) -> Dict[str, Any]:
+        g_passk_details = OrderedDict()
+        all_subdivisions = set(
+            [detail['example_abbr'].split('_')[0] for detail in details])
+        all_metrics = list(details[0].keys())
+
+        for subdivision in sorted(list(all_subdivisions)):
+            for metric in all_metrics:
+                if metric in ['predictions', 'example_abbr']:
+                    continue
+                g_passk_details[f'{subdivision}/{metric}'] = 100 * np.mean([
+                    detail[metric] for detail in details
+                    if detail['example_abbr'].split('_')[0] == subdivision
+                ])
+
+        for metric in all_metrics:
+            if metric in ['predictions', 'example_abbr']:
+                continue
+            g_passk_details[metric] = 100. * np.mean(
+                [detail[metric] for detail in details])
+        return g_passk_details
+
+    def evaluate(self, k: Union[int, List[int]], n: int,
+                 original_dataset: Dataset, **score_kwargs):
+        real_size = len(original_dataset) // n
+        all_details = []
+        all_results = []
+        for i in range(n):
+
+            def select_fn(i, real_size, x):
+                if isinstance(x, Dataset):
+                    return x.select(range(i * real_size, (i + 1) * real_size))
+                elif isinstance(x, Iterable):
+                    return x[i * real_size:(i + 1) * real_size]
+                else:
+                    return x
+
+            results = self.score(
+                **{
+                    key: select_fn(i, real_size, value)
+                    for key, value in score_kwargs.items()
+                })
+            details = results.pop('details', None)
+            if details is not None:
+                if isinstance(details, Dict):
+                    details = list(details.values())
+                all_details.extend(details)
+            all_results.append(results)
+
+        eval_results = {}
+        for single_results in all_results:
+            for key in single_results:
+                if key not in eval_results:
+                    eval_results[key] = []
+                eval_results[key].append(single_results[key])
+        for key in deepcopy(eval_results):
+            if isinstance(eval_results[key][0], float) or isinstance(
+                    eval_results[key][0], int):
+                if n > 1:
+                    eval_results[key + f' ({n} runs average)'] = np.mean(
+                        eval_results[key])
+                    eval_results.pop(key)
+                else:
+                    eval_results[key] = np.mean(eval_results[key])
+            else:
+                eval_results[key] = eval_results[key][0]
+
+        grouped_examples = self.group(n, all_details, original_dataset)
+        can_calculate = False
+        if len(all_details) != 0:
+            eval_details = []
+            for example_abbr, examples in grouped_examples.items():
+                detail = {'predictions': [], 'example_abbr': example_abbr}
+
+                c = 0
+                for example in examples:
+                    detail['predictions'].append(example['detail'])
+                    # only compute G-Pass@k when details have correct labels
+                    if example['detail'].get('correct', None) is not None:
+                        can_calculate = True
+                        c += int(example['detail']['correct'])
+                    elif example['detail'].get('is_correct', None) is not None:
+                        can_calculate = True
+                        c += int(example['detail']['is_correct'])
+
+                if can_calculate and n > 1 and k > 1:
+                    thresholds = [0.0, 0.25, 0.5, 0.75, 1.0]
+                    for _k in ([k] if isinstance(k, int) else k):
+                        for threshold in thresholds:
+                            g_pass = compute_g_pass_at_k(n=n,
+                                                         c=c,
+                                                         k=_k,
+                                                         t=threshold)
+                            detail[f'G-Pass@{_k}_{threshold}'] = g_pass
+                        detail[f'mG-Pass@{_k}'] = compute_mg_pass_at_k(n=n,
+                                                                       c=c,
+                                                                       k=_k)
+
+                eval_details.append(detail)
+
+            if can_calculate and n > 1 and k > 1:
+                eval_results.update(self.reduce(eval_details))
+            eval_results['details'] = eval_details
+
+        return eval_results
+
    def score(self):
        raise NotImplementedError("Method hasn't been implemented yet")

--- a/opencompass/openicl/icl_evaluator/icl_gpassk_evaluator.py
+++ b/opencompass/openicl/icl_evaluator/icl_gpassk_evaluator.py
@ -1,163 +0,0 @@
-from abc import abstractmethod
-from typing import Any, Dict, List, Union
-
-import numpy as np
-from scipy.stats import hypergeom
-
-from opencompass.registry import ICL_EVALUATORS
-
-from .icl_base_evaluator import BaseEvaluator
-
-
-def compute_pass_at_k(n, c, k):
-    if n - c < k:
-        return 1.0
-    return 1.0 - np.prod(1.0 - k / np.arange(n - c + 1, n + 1))
-
-
-def _compute_g_pass_at_k(n, c, k, m):
-    if m > min(c, k) or k > n or c < 0 or n <= 0 or m < 0:
-        return 0.0
-    return hypergeom.sf(m - 1, n, c, k)
-
-
-def compute_g_pass_at_k(n, c, k, t):
-    m = max(int(np.ceil(k * t)), 1)
-    return _compute_g_pass_at_k(n, c, k, m)
-
-
-def compute_mg_pass_at_k(n, c, k):
-    l, r = int(np.ceil(k * 0.5)), k
-
-    mg_pass_at_k = 0.0
-    for i in range(l + 1, r + 1):
-        mg_pass_at_k += _compute_g_pass_at_k(n, c, k, i)
-    mg_pass_at_k = 2 * mg_pass_at_k / k
-
-    return mg_pass_at_k
-
-
-@ICL_EVALUATORS.register_module()
-class GPassKEvaluator(BaseEvaluator):
-    """Evaluator for computing the G-Pass@k Metric.
-
-    This evaluator performs the following steps:
-    1. Invokes task-specific `preprocess` on predictions to
-    assign a consistency label to each prediction and its
-    corresponding reference.
-    2. Calculates metrics for each input example based on
-    these labels.
-    3. Aggregates the overall metrics through a task-specific
-    `postprocess`.
-
-    Args:
-        k (int or list of int): Number of predictions to be
-        considered in G-Pass@k. It can be a single integer
-        (e.g., `k=16` computes G-Pass@16) or a list of
-        integers (e.g., `[4, 8, 16]` computes G-Pass@4,
-        G-Pass@8, and G-Pass@16).
-
-        replication (int): Controls the number of generations
-        used to estimate G-Pass@k. The total number of
-        generations is determined by multiplying the
-        maximum of `k` with `replication`. This parameter
-        should be a single integer.
-
-        thresholds (list of float): A list of floating-point
-        numbers that define the thresholds for the G-Pass@k
-        metric.
-    """
-
-    def __init__(
-            self,
-            k: Union[int, List[int]] = 16,
-            replication: int = 3,
-            thresholds: List[float] = [0.0, 0.25, 0.5, 0.75, 1.0]) -> None:
-        super().__init__()
-
-        if isinstance(k, int):
-            k = [k]
-
-        self.k = k
-        self.replication = replication
-        self.n = max(k) * replication
-        self.thresholds = thresholds
-
-    @property
-    def output_dir(self):
-        # please see opencompass/opencompass/tasks/openicl_eval.py Line 197-200
-        return self._out_dir
-
-    @abstractmethod
-    def preprocess(self, predictions, references, test_set) -> None:
-        """Perform operations on predictions before computing metrics, for
-        example, do answer_extraction and model_judge in mathematical reasoning
-        task.
-
-        Return:
-            labels: A list contains the label which indicates whether
-            prediction is consistency with reference at each position.
-        """
-        raise NotImplementedError
-
-    @abstractmethod
-    def group(self, predictions, labels, test_set) -> Dict[str, Any]:
-        """Group the predictions and references.
-
-        Return:
-            A dict contains the grouped predictions and references.
-        """
-        raise NotImplementedError
-
-    @abstractmethod
-    def reduce(self, details) -> Dict[str, Any]:
-        """Aggregate the overall metrics.
-
-        Return:
-            A dict contains overall metrics, like:
-            {'details': details for each example, 'G-Pass@16': xxx}
-        """
-        raise NotImplementedError
-
-    def score(self, predictions, references, test_set) -> Dict[str, Any]:
-        """Compute G-Pass@k metrics.
-
-        Return:
-            A dict contains  metrics for each dataset sample and
-            overall metrics reduced by `self.reduce`, like:
-            {'details': details for each example, 'G-Pass@16': xxx}
-        """
-        labels = self.preprocess(predictions, references, test_set)
-        grouped_examples = self.group(predictions, labels, test_set)
-
-        details = []
-        total_pass_num, count = 0, 0
-        for example_abbr, examples in grouped_examples.items():
-            detail = {
-                k: v
-                for k, v in examples[0].items()
-                if k not in ['prediction', 'label']
-            }
-            detail.update({
-                'predictions': [{
-                    'prediction': example['prediction'],
-                    'label': example['label']
-                } for example in examples],
-            })
-
-            current_example_labels = [e['label'] for e in examples]
-            c = int(np.sum(current_example_labels))
-
-            for k in self.k:
-                for threshold in self.thresholds:
-                    detail[f'G-Pass@{k}_{threshold}'] = compute_g_pass_at_k(
-                        n=self.n, c=c, k=k, t=threshold)
-                detail[f'mG-Pass@{k}'] = compute_mg_pass_at_k(n=self.n,
-                                                              c=c,
-                                                              k=k)
-            count += self.n
-            total_pass_num += c
-
-            details.append(detail)
-
-        return self.reduce(details)
--- a/opencompass/tasks/openicl_eval.py
+++ b/opencompass/tasks/openicl_eval.py
@ -240,7 +240,10 @@ class OpenICLEvalTask(BaseTask):
                k: preds[k]
                for k in signature(icl_evaluator.score).parameters
            }
-            result = icl_evaluator.score(**preds)
+            k = self.dataset_cfg.get('k', 1)
+            n = self.dataset_cfg.get('n', 1)
+            result = icl_evaluator.evaluate(k, n, copy.deepcopy(test_set),
+                                            **preds)

            # Get model postprocess result
            model_details = None
@ -248,7 +251,9 @@ class OpenICLEvalTask(BaseTask):
            if 'model_postprocessor' in self.eval_cfg:
                model_preds = copy.deepcopy(preds)
                model_preds['predictions'] = model_pred_strs
-                model_result = icl_evaluator.score(**model_preds)
+                model_result = icl_evaluator.evaluate(k, n,
+                                                      copy.deepcopy(test_set),
+                                                      **model_preds)
                for key in model_result:
                    if key == 'details':
                        model_details = model_result[key]
--- a/opencompass/utils/build.py
+++ b/opencompass/utils/build.py
@ -9,7 +9,6 @@ def build_dataset_from_cfg(dataset_cfg: ConfigDict):
    dataset_cfg = copy.deepcopy(dataset_cfg)
    dataset_cfg.pop('infer_cfg', None)
    dataset_cfg.pop('eval_cfg', None)
-    dataset_cfg.pop('abbr', None)
    return LOAD_DATASET.build(dataset_cfg)