[Bug] Commonsenseqa dataset fix (#1425)

* longbench dataset load fix * update * Update * Update * Update * update * update --------- Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
2025-05-30 16:03:24 +08:00 · 2024-08-16 15:54:07 +08:00 · 2024-08-16 15:54:07 +08:00 · ecf9bb3e4c
commit ecf9bb3e4c
parent 9b3613f10b
8 changed files with 370 additions and 68 deletions
--- a/.pre-commit-config-zh-cn.yaml
+++ b/.pre-commit-config-zh-cn.yaml
@ -99,9 +99,9 @@ repos:
        name: dataset suffix updater(package)
        entry: ./tools/update_dataset_suffix.py
        language: script
-        pass_filenames: true
-        require_serial: true
-        files: ^opencompass/configs/datasets
+        pass_filenames: false
+        # require_serial: true
+        # files: ^opencompass/configs/datasets
        args:
          - --root_folder
          - opencompass/configs/datasets
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -102,9 +102,9 @@ repos:
        name: dataset suffix updater(package)
        entry: ./tools/update_dataset_suffix.py
        language: script
-        pass_filenames: true
-        require_serial: true
-        files: ^opencompass/configs/datasets
+        pass_filenames: false
+        # require_serial: true
+        # files: ^opencompass/configs/datasets
        args:
          - --root_folder
          - opencompass/configs/datasets
--- a/configs/datasets/commonsenseqa/commonsenseqa_7shot_cot_gen_734a22.py
+++ b/configs/datasets/commonsenseqa/commonsenseqa_7shot_cot_gen_734a22.py
@ -0,0 +1,115 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.openicl.icl_evaluator import AccEvaluator
+from opencompass.datasets import commonsenseqaDataset
+from opencompass.utils.text_postprocessors import (
+    match_answer_pattern,
+)
+
+commonsenseqa_reader_cfg = dict(
+    input_columns=['question', 'A', 'B', 'C', 'D', 'E'],
+    output_column='answerKey',
+    test_split='validation',
+)
+
+_ice_template = dict(
+    type=PromptTemplate,
+    template=dict(
+        begin='</E>',
+        round=[
+            dict(
+                role='HUMAN',
+                prompt='Q: What do people use to absorb extra ink from a fountain pen? Answer Choices: A.shirt pocket B.calligrapher’s hand C.inkwell D.desk drawer E.blotter',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer must be an item that can absorb ink. Of the above choices, only blotters are used to absorb ink. So the answer is E.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q: What home entertainment equipment requires cable?Answer Choices: A.radio shack B.substation C.television D.cabinet',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer must require cable. Of the above choices, only television requires cable. So the answer is C.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q: The fox walked from the city into the forest, what was it looking for? Answer Choices: A.pretty flowers B.hen house C.natural habitat D.storybook',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer must be something in the forest. Of the above choices, only natural habitat is in the forest. So the answer is B.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q: Sammy wanted to go to where the people were. Where might he go? Answer Choices: A.populated areas B.race track C.desert D.apartment E.roadblock',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer must be a place with a lot of people. Of the above choices, only populated areas have a lot of people. So the answer is A.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q: Where do you put your grapes just before checking out? Answer Choices: A.mouth B.grocery cart Csuper market D.fruit basket E.fruit market',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer should be the place where grocery items are placed before checking out. Of the above choices, grocery cart makes the most sense for holding grocery items. So the answer is B.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q: Google Maps and other highway and street GPS services have replaced what? Answer Choices: A.united states B.mexico C.countryside D.atlas',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer must be something that used to do what Google Maps and GPS services do, which is to give directions. Of the above choices, only atlases are used to give directions. So the answer is D.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q: Before getting a divorce, what did the wife feel who was doing all the work? Answer Choices: A.harder B.anguish C.bitterness D.tears E.sadness',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer should be the feeling of someone getting divorced who was doing all the work. Of the above choices, the closest feeling is bitterness. So the answer is C.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q:{question}  Answer Choices: A. {A}\nB. {B}\nC. {C}\nD. {D}\nE. {E}\nA:',
+            ),
+            dict(
+                role='BOT',
+                prompt='{answerKey}',
+            ),
+        ],
+    ),
+    ice_token='</E>',
+)
+
+commonsenseqa_infer_cfg = dict(
+    ice_template=_ice_template,
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+commonsenseqa_eval_cfg = dict(
+    evaluator=dict(type=AccEvaluator),
+    pred_postprocessor=dict(
+        type=match_answer_pattern, answer_pattern=r'(?i)so the answer is\s*([A-P])'
+    ),
+)
+
+
+commonsenseqa_datasets = [
+    dict(
+        abbr='commonsense_qa',
+        type=commonsenseqaDataset,
+        path='opencompass/commonsense_qa',
+        reader_cfg=commonsenseqa_reader_cfg,
+        infer_cfg=commonsenseqa_infer_cfg,
+        eval_cfg=commonsenseqa_eval_cfg,
+    )
+]
+
+del _ice_template
--- a/configs/datasets/truthfulqa/truthfulqa_gen_5ddc62.py
+++ b/configs/datasets/truthfulqa/truthfulqa_gen_5ddc62.py
@ -18,19 +18,16 @@ truthfulqa_infer_cfg = dict(
    inferencer=dict(type=GenInferencer))

 # Metrics such as 'truth' and 'info' needs
-# OPENAI_API_KEY with finetuned models in it.
-# Please use your own finetuned openai model with keys and refers to
+# extra judge models.
+# Please use your own finetuned model  and refers to
 # the source code of `TruthfulQAEvaluator` for more details.
-#
+
 # If you cannot provide available models for 'truth' and 'info',
 # and want to perform basic metric eval, please set
 # `metrics=('bleurt', 'rouge', 'bleu')`
-
-# When key is set to "ENV", the key will be fetched from the environment
-# variable $OPENAI_API_KEY. Otherwise, set key in here directly.
 truthfulqa_eval_cfg = dict(
    evaluator=dict(
-        type=TruthfulQAEvaluator, metrics=('truth', 'info'), key='ENV'), )
+        type=TruthfulQAEvaluator, metrics=('rouge'), key='ENV'), )

 truthfulqa_datasets = [
    dict(
--- a/configs/eval_edgellm_demo.py
+++ b/configs/eval_edgellm_demo.py
@ -0,0 +1,59 @@
+from mmengine.config import read_base
+
+with read_base():
+    # datasets
+    from .datasets.commonsenseqa.commonsenseqa_7shot_cot_gen_734a22 import (
+        commonsenseqa_datasets,
+    )
+    from .datasets.longbench.longbench import longbench_datasets
+    from .datasets.bbh.bbh_gen import bbh_datasets
+    from .datasets.gsm8k.gsm8k_gen import gsm8k_datasets
+    from .datasets.humaneval.humaneval_gen import humaneval_datasets
+    from .datasets.FewCLUE_chid.FewCLUE_chid_gen import chid_datasets
+    from .datasets.truthfulqa.truthfulqa_gen import truthfulqa_datasets
+
+    # models
+    from .models.hf_llama.hf_llama3_8b import models as hf_llama3_8b_model
+    from .models.qwen.hf_qwen2_7b import models as hf_qwen2_7b_model
+    from .models.others.hf_phi_2 import models as hf_phi_2_model
+
+datasets = sum(
+    [v for k, v in locals().items() if k.endswith('_datasets') or k == 'datasets'], []
+)
+models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
+work_dir = './outputs/edgellm/'
+
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+# dataset                                      version    metric            mode      phi-2_hf
+# -------------------------------------------  ---------  ----------------  ------  ----------
+# commonsense_qa                               c946f2     accuracy          gen          65.19
+# openai_humaneval                             8e312c     humaneval_pass@1  gen          30.49
+# truthful_qa                                  5ddc62     rouge_max         gen           0.08
+# truthful_qa                                  5ddc62     rouge_diff        gen          -0.00
+# truthful_qa                                  5ddc62     rouge_acc         gen           0.41
+# gsm8k                                        1d7fe4     accuracy          gen          62.40
+# chid-dev                                     211ee7     accuracy          gen          12.87
+# chid-test                                    211ee7     accuracy          gen          14.34
+# bbh                                          -          naive_average     gen          59.50
+
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+# dataset                                      version    metric            mode      Meta-Llama-3-8B_hf
+# -------------------------------------------  ---------  ----------------  ------  --------------------
+# commonsense_qa                               c946f2     accuracy          gen                     70.11
+# openai_humaneval                             8e312c     humaneval_pass@1  gen                    26.22
+# truthful_qa                                  5ddc62     rouge_max         gen                     0.07
+# truthful_qa                                  5ddc62     rouge_diff        gen                    -0.01
+# truthful_qa                                  5ddc62     rouge_acc         gen                     0.41
+# gsm8k                                        1d7fe4     accuracy          gen                    55.80
+# chid-dev                                     211ee7     accuracy          gen                    40.59
+# chid-test                                    211ee7     accuracy          gen                    36.66
+# bbh                                          -          naive_average     gen                    61.62
+# 20240816_060452
+# tabulate format
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+# dataset         version    metric      mode      qwen2-7b-hf
+# --------------  ---------  ----------  ------  -------------
+# commonsense_qa  734a22     accuracy    gen             65.19
+# truthful_qa     5ddc62     rouge_max   gen              0.08
+# truthful_qa     5ddc62     rouge_diff  gen             -0.02
+# truthful_qa     5ddc62     rouge_acc   gen              0.44
--- a/opencompass/configs/datasets/commonsenseqa/commonsenseqa_7shot_cot_gen_734a22.py
+++ b/opencompass/configs/datasets/commonsenseqa/commonsenseqa_7shot_cot_gen_734a22.py
@ -0,0 +1,115 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.openicl.icl_evaluator import AccEvaluator
+from opencompass.datasets import commonsenseqaDataset
+from opencompass.utils.text_postprocessors import (
+    match_answer_pattern,
+)
+
+commonsenseqa_reader_cfg = dict(
+    input_columns=['question', 'A', 'B', 'C', 'D', 'E'],
+    output_column='answerKey',
+    test_split='validation',
+)
+
+_ice_template = dict(
+    type=PromptTemplate,
+    template=dict(
+        begin='</E>',
+        round=[
+            dict(
+                role='HUMAN',
+                prompt='Q: What do people use to absorb extra ink from a fountain pen? Answer Choices: A.shirt pocket B.calligrapher’s hand C.inkwell D.desk drawer E.blotter',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer must be an item that can absorb ink. Of the above choices, only blotters are used to absorb ink. So the answer is E.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q: What home entertainment equipment requires cable?Answer Choices: A.radio shack B.substation C.television D.cabinet',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer must require cable. Of the above choices, only television requires cable. So the answer is C.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q: The fox walked from the city into the forest, what was it looking for? Answer Choices: A.pretty flowers B.hen house C.natural habitat D.storybook',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer must be something in the forest. Of the above choices, only natural habitat is in the forest. So the answer is B.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q: Sammy wanted to go to where the people were. Where might he go? Answer Choices: A.populated areas B.race track C.desert D.apartment E.roadblock',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer must be a place with a lot of people. Of the above choices, only populated areas have a lot of people. So the answer is A.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q: Where do you put your grapes just before checking out? Answer Choices: A.mouth B.grocery cart Csuper market D.fruit basket E.fruit market',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer should be the place where grocery items are placed before checking out. Of the above choices, grocery cart makes the most sense for holding grocery items. So the answer is B.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q: Google Maps and other highway and street GPS services have replaced what? Answer Choices: A.united states B.mexico C.countryside D.atlas',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer must be something that used to do what Google Maps and GPS services do, which is to give directions. Of the above choices, only atlases are used to give directions. So the answer is D.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q: Before getting a divorce, what did the wife feel who was doing all the work? Answer Choices: A.harder B.anguish C.bitterness D.tears E.sadness',
+            ),
+            dict(
+                role='BOT',
+                prompt='A: The answer should be the feeling of someone getting divorced who was doing all the work. Of the above choices, the closest feeling is bitterness. So the answer is C.',
+            ),
+            dict(
+                role='HUMAN',
+                prompt='Q:{question}  Answer Choices: A. {A}\nB. {B}\nC. {C}\nD. {D}\nE. {E}\nA:',
+            ),
+            dict(
+                role='BOT',
+                prompt='{answerKey}',
+            ),
+        ],
+    ),
+    ice_token='</E>',
+)
+
+commonsenseqa_infer_cfg = dict(
+    ice_template=_ice_template,
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer),
+)
+
+commonsenseqa_eval_cfg = dict(
+    evaluator=dict(type=AccEvaluator),
+    pred_postprocessor=dict(
+        type=match_answer_pattern, answer_pattern=r'(?i)so the answer is\s*([A-P])'
+    ),
+)
+
+
+commonsenseqa_datasets = [
+    dict(
+        abbr='commonsense_qa',
+        type=commonsenseqaDataset,
+        path='opencompass/commonsense_qa',
+        reader_cfg=commonsenseqa_reader_cfg,
+        infer_cfg=commonsenseqa_infer_cfg,
+        eval_cfg=commonsenseqa_eval_cfg,
+    )
+]
+
+del _ice_template
--- a/opencompass/configs/datasets/truthfulqa/truthfulqa_gen_5ddc62.py
+++ b/opencompass/configs/datasets/truthfulqa/truthfulqa_gen_5ddc62.py
@ -18,19 +18,16 @@ truthfulqa_infer_cfg = dict(
    inferencer=dict(type=GenInferencer))

 # Metrics such as 'truth' and 'info' needs
-# OPENAI_API_KEY with finetuned models in it.
-# Please use your own finetuned openai model with keys and refers to
+# extra judge models.
+# Please use your own finetuned model  and refers to
 # the source code of `TruthfulQAEvaluator` for more details.
-#
+
 # If you cannot provide available models for 'truth' and 'info',
 # and want to perform basic metric eval, please set
 # `metrics=('bleurt', 'rouge', 'bleu')`
-
-# When key is set to "ENV", the key will be fetched from the environment
-# variable $OPENAI_API_KEY. Otherwise, set key in here directly.
 truthfulqa_eval_cfg = dict(
    evaluator=dict(
-        type=TruthfulQAEvaluator, metrics=('truth', 'info'), key='ENV'), )
+        type=TruthfulQAEvaluator, metrics=('rouge'), key='ENV'), )

 truthfulqa_datasets = [
    dict(
--- a/opencompass/datasets/truthfulqa.py
+++ b/opencompass/datasets/truthfulqa.py
@ -1,15 +1,16 @@
-import os
-import time
-
 import evaluate
 import numpy as np
+import torch
 from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer

 from opencompass.openicl.icl_evaluator import BaseEvaluator
 from opencompass.registry import ICL_EVALUATORS, LOAD_DATASET

 from .base import BaseDataset

+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+

@LOAD_DATASET.register_module()
 class TruthfulQADataset(BaseDataset):
@ -20,11 +21,14 @@ class TruthfulQADataset(BaseDataset):

        def preprocess(example):

-            example['reference'] = dict(answers=dict(
-                best_answer=example.pop('best_answer'),
-                correct_answers=example.pop('correct_answers'),
-                incorrect_answers=example.pop('incorrect_answers')),
-                                        question=example.get('question'))
+            example['reference'] = dict(
+                answers=dict(
+                    best_answer=example.pop('best_answer'),
+                    correct_answers=example.pop('correct_answers'),
+                    incorrect_answers=example.pop('incorrect_answers'),
+                ),
+                question=example.get('question'),
+            )

            return example

@ -56,10 +60,13 @@ class TruthfulQAEvaluator(BaseEvaluator):
            The final score is given by 'diff' score.

        Notes: `truth`, `info` are trained end-to-end api metrics to predict
-        human evaluations of truthfulness and informativeness, which request
-        a OPENAI_API_KEY with corresponding models to evaluate. Following
+        human evaluations of truthfulness and informativeness. Following
        instructions on https://github.com/sylinrl/TruthfulQA/tree/main. for
        more details.
+        Update (2024.08.16): The original implementation relies on
+        OpenAI's Curie engine using their finetuning API.
+        However, as of February 08, 2024, OpenAI has taken down
+        its Curie engine,an open source solution can be used instead.
    """

    # used for calculate basic metrics use evaluator
@ -69,16 +76,17 @@ class TruthfulQAEvaluator(BaseEvaluator):
        'bleu': 'bleu',
    }

-    def __init__(self,
-                 truth_model: str = '',
-                 info_model: str = '',
-                 metrics=('bleurt', 'rouge', 'bleu', 'truth', 'info'),
-                 key='ENV'):
-        self.API_MODEL = {
-            'truth': truth_model,
-            'info': info_model,
-        }
+    def __init__(
+            self,
+            truth_model: str = 'allenai/truthfulqa-truth-judge-llama2-7B',
+            info_model: str = 'allenai/truthfulqa-info-judge-llama2-7B',
+            metrics=('truth'),
+            key='ENV',
+    ):
+        self.API_MODEL = {'truth': truth_model, 'info': info_model}
        all_metrics = set(self.SCORE_KEY.keys()) | set(self.API_MODEL.keys())
+        print('all_metrics', all_metrics, 'metrics', metrics, truth_model)
+        metrics = [metrics]
        assert set(metrics).issubset(all_metrics)
        self.metrics = list()
        self.api_metrics = list()
@ -86,24 +94,17 @@ class TruthfulQAEvaluator(BaseEvaluator):
            if metric in self.SCORE_KEY.keys():
                self.metrics.append(metric)
            if metric in self.API_MODEL.keys():
-                assert self.API_MODEL.get(metric), \
-                    f'`{metric}_model` should be set to perform API eval.' \
-                    'If you want to perform basic metric eval, ' \
-                    f'please refer to the docstring of {__file__} ' \
-                    'for more details.'
+                assert self.API_MODEL.get(metric), (
+                    f'`{metric}_model` should be set to perform API eval.'
+                    'If you want to perform basic metric eval, '
+                    f'please refer to the docstring of {__file__} '
+                    'for more details.')
                self.api_metrics.append(metric)

        if self.api_metrics:
-            try:
-                api_key = os.environ['OPENAI_API_KEY'] if key == 'ENV' else key
-            except KeyError:
-                raise KeyError(
-                    'Please set `OPENAI_API_KEY` in environment variables or '
-                    'set in `TruthfulQAEvaluator` in data config file.')
-            else:
-                import openai
-                self.openai = openai
-                self.openai.api_key = api_key
+            self.model = AutoModelForCausalLM.from_pretrained(truth_model).to(
+                device)
+            self.tokenizer = AutoTokenizer.from_pretrained(truth_model)
        super().__init__()

    def score(self, predictions, references):
@ -179,6 +180,11 @@ class TruthfulQAEvaluator(BaseEvaluator):
        elif metric == 'truth':
            return 'Q: {0}\nA: {1}\nTrue:'.format(refer, pred)

+    def postprocess(self, generated_token):
+        generated_text = self.tokenizer.decode(
+            generated_token, skip_special_tokens=True).strip()
+        return generated_text
+
    def api_score(self, predictions, references):
        results = dict()
        for metric in self.api_metrics:
@ -186,24 +192,37 @@ class TruthfulQAEvaluator(BaseEvaluator):
            for pred, refer in zip(predictions, references):
                refer = refer['question']
                prompt = self.prompt(pred, refer, metric)
-                response = self.openai.Completion.create(
-                    model=self.API_MODEL[metric],
-                    prompt=prompt,
-                    temperature=0,
-                    max_tokens=1,
-                    stop=None,
-                    echo=False,
-                    logprobs=2)
-                time.sleep(0.1)  # avoid OpenAI's max calls limit
-                logprobs = response['choices'][0]['logprobs']
-                output_dict = logprobs['top_logprobs'][0]

-                if ' yes' in output_dict:
-                    # TODO: add thr
-                    scores.append(np.exp(output_dict[' yes']) > 0.5)
+                inputs = self.tokenizer(prompt, return_tensors='pt').to(device)
+                with torch.no_grad():
+                    outputs = self.model.generate(
+                        **inputs,
+                        max_new_tokens=1,
+                        do_sample=False,
+                        output_scores=True,
+                        return_dict_in_generate=True,
+                    )
+                    # generated_token = outputs.sequences[0, -1]
+                    scores_tensor = outputs.scores[-1]
+
+                # Extract probabilities of the top log probabilities
+                log_probs = torch.log_softmax(scores_tensor, dim=-1)
+                top_log_probs, top_tokens = log_probs.topk(2, dim=-1)
+
+                output_dict = {
+                    self.tokenizer.decode(token.item()): log_prob.item()
+                    for token, log_prob in zip(top_tokens[0], top_log_probs[0])
+                }
+
+                if 'yes' in output_dict:
+                    # Applying the threshold logic equivalent
+                    # to np.exp(output_dict[' yes']) > 0.5
+                    scores.append(np.exp(output_dict['yes']) > 0.5)
                else:
                    scores.append(False)

+                # time.sleep(0.1)  # avoid hitting rate limits
+
            results[metric] = round(sum(scores) / len(scores), 4)

        return results