OpenCompass/opencompass/datasets/TheoremQA/main.py

import re
import json

from datasets import Dataset, DatasetDict

from opencompass.registry import LOAD_DATASET, TEXT_POSTPROCESSORS, ICL_EVALUATORS
from opencompass.utils import get_data_path

from opencompass.openicl.icl_evaluator import BaseEvaluator
from ..base import BaseDataset
from . import utils
from tqdm import tqdm


@LOAD_DATASET.register_module()
class TheoremQADatasetV3(BaseDataset):

    @staticmethod
    def load(path: str):
        path = get_data_path(path, local_mode=True)
        with open(path, 'r') as f:
            data = json.load(f)
        for item in data:
            item['Answer'] = str(item['Answer'])
        dataset = Dataset.from_list(data)
        return dataset


def TheoremQA_postprocess_v3(text: str) -> str:
    answer = utils.answer_clean(["The answer is:", "The answer is", "the answer is"], text)
    return answer

def TheoremQA_postprocess_v4(text: str) -> str:
    # First clean the answer text
    answer = utils.answer_clean(["The answer is:", "The answer is", "the answer is"], text)
    # Remove LaTeX delimiters \( and \) and strip whitespace
    answer = answer.strip('\\(').strip('\\)').strip()
    return answer


@ICL_EVALUATORS.register_module()
class TheoremQAEvaluatorV3(BaseEvaluator):
    def score(self, predictions, references, test_set):
        if len(predictions) != len(references):
            return {"error": "preds and refrs have different length"}

        details = []
        correct, wrong = 0, 0
        for index in tqdm(range(len(predictions))):
            answer = predictions[index]
            groundtruth = references[index]
            answer_type = test_set[index]['Answer_type']
            if answer_type in ['float', 'integer', 'bool']:
                groundtruth = [groundtruth, eval(groundtruth)]
            else:
                groundtruth = [groundtruth, None]
            if utils.compare_answer_with_groundtruth(answer, *groundtruth):
                correct += 1
                is_correct = True
            else:
                wrong += 1
                is_correct = False

            details.append(
                {
                    # "question": question,
                    # "solution": output,
                    "correct": groundtruth,
                    "pred": answer,
                    "is_correct": is_correct,
                }
            )

        score = correct / (correct + wrong) * 100
        return {'score': score, 'details': details}
[Feature] Add TheoremQA with 5-shot (#1048) * add TheoremQA with 5-shot * cherry pick from add-huggingface-above-v4.33, good TheoremQA results 2024-04-22 15:22:04 +08:00			`import re`
			`import json`

			`from datasets import Dataset, DatasetDict`

			`from opencompass.registry import LOAD_DATASET, TEXT_POSTPROCESSORS, ICL_EVALUATORS`
[Feature] Support ModelScope datasets (#1289) * add ceval, gsm8k modelscope surpport * update race, mmlu, arc, cmmlu, commonsenseqa, humaneval and unittest * update bbh, flores, obqa, siqa, storycloze, summedits, winogrande, xsum datasets * format file * format file * update dataset format * support ms_dataset * udpate dataset for modelscope support * merge myl_dev and update test_ms_dataset * udpate dataset for modelscope support * update readme * update eval_api_zhipu_v2 * remove unused code * add get_data_path function * update readme * remove tydiqa japanese subset * add ceval, gsm8k modelscope surpport * update race, mmlu, arc, cmmlu, commonsenseqa, humaneval and unittest * update bbh, flores, obqa, siqa, storycloze, summedits, winogrande, xsum datasets * format file * format file * update dataset format * support ms_dataset * udpate dataset for modelscope support * merge myl_dev and update test_ms_dataset * update readme * udpate dataset for modelscope support * update eval_api_zhipu_v2 * remove unused code * add get_data_path function * remove tydiqa japanese subset * update util * remove .DS_Store * fix md format * move util into package * update docs/get_started.md * restore eval_api_zhipu_v2.py, add environment setting * Update dataset * Update * Update * Update * Update --------- Co-authored-by: Yun lin <yunlin@U-Q9X2K4QV-1904.local> Co-authored-by: Yunnglin <mao.looper@qq.com> Co-authored-by: Yun lin <yunlin@laptop.local> Co-authored-by: Yunnglin <maoyl@smail.nju.edu.cn> Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn> 2024-07-29 13:48:32 +08:00			`from opencompass.utils import get_data_path`
[Feature] Add TheoremQA with 5-shot (#1048) * add TheoremQA with 5-shot * cherry pick from add-huggingface-above-v4.33, good TheoremQA results 2024-04-22 15:22:04 +08:00
			`from opencompass.openicl.icl_evaluator import BaseEvaluator`
			`from ..base import BaseDataset`
			`from . import utils`
			`from tqdm import tqdm`


			`@LOAD_DATASET.register_module()`
			`class TheoremQADatasetV3(BaseDataset):`

			`@staticmethod`
			`def load(path: str):`
[Feature] Support ModelScope datasets (#1289) * add ceval, gsm8k modelscope surpport * update race, mmlu, arc, cmmlu, commonsenseqa, humaneval and unittest * update bbh, flores, obqa, siqa, storycloze, summedits, winogrande, xsum datasets * format file * format file * update dataset format * support ms_dataset * udpate dataset for modelscope support * merge myl_dev and update test_ms_dataset * udpate dataset for modelscope support * update readme * update eval_api_zhipu_v2 * remove unused code * add get_data_path function * update readme * remove tydiqa japanese subset * add ceval, gsm8k modelscope surpport * update race, mmlu, arc, cmmlu, commonsenseqa, humaneval and unittest * update bbh, flores, obqa, siqa, storycloze, summedits, winogrande, xsum datasets * format file * format file * update dataset format * support ms_dataset * udpate dataset for modelscope support * merge myl_dev and update test_ms_dataset * update readme * udpate dataset for modelscope support * update eval_api_zhipu_v2 * remove unused code * add get_data_path function * remove tydiqa japanese subset * update util * remove .DS_Store * fix md format * move util into package * update docs/get_started.md * restore eval_api_zhipu_v2.py, add environment setting * Update dataset * Update * Update * Update * Update --------- Co-authored-by: Yun lin <yunlin@U-Q9X2K4QV-1904.local> Co-authored-by: Yunnglin <mao.looper@qq.com> Co-authored-by: Yun lin <yunlin@laptop.local> Co-authored-by: Yunnglin <maoyl@smail.nju.edu.cn> Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn> 2024-07-29 13:48:32 +08:00			`path = get_data_path(path, local_mode=True)`
[Feature] Add TheoremQA with 5-shot (#1048) * add TheoremQA with 5-shot * cherry pick from add-huggingface-above-v4.33, good TheoremQA results 2024-04-22 15:22:04 +08:00			`with open(path, 'r') as f:`
			`data = json.load(f)`
			`for item in data:`
			`item['Answer'] = str(item['Answer'])`
			`dataset = Dataset.from_list(data)`
			`return dataset`


			`def TheoremQA_postprocess_v3(text: str) -> str:`
			`answer = utils.answer_clean(["The answer is:", "The answer is", "the answer is"], text)`
			`return answer`

[Update] Add 0shot CoT config for TheoremQA (#1783) 2024-12-27 16:17:27 +08:00			`def TheoremQA_postprocess_v4(text: str) -> str:`
			`# First clean the answer text`
			`answer = utils.answer_clean(["The answer is:", "The answer is", "the answer is"], text)`
			`# Remove LaTeX delimiters \( and \) and strip whitespace`
			`answer = answer.strip('\\(').strip('\\)').strip()`
			`return answer`

[Feature] Add TheoremQA with 5-shot (#1048) * add TheoremQA with 5-shot * cherry pick from add-huggingface-above-v4.33, good TheoremQA results 2024-04-22 15:22:04 +08:00
			`@ICL_EVALUATORS.register_module()`
			`class TheoremQAEvaluatorV3(BaseEvaluator):`
			`def score(self, predictions, references, test_set):`
			`if len(predictions) != len(references):`
			`return {"error": "preds and refrs have different length"}`

			`details = []`
			`correct, wrong = 0, 0`
			`for index in tqdm(range(len(predictions))):`
			`answer = predictions[index]`
			`groundtruth = references[index]`
			`answer_type = test_set[index]['Answer_type']`
			`if answer_type in ['float', 'integer', 'bool']:`
			`groundtruth = [groundtruth, eval(groundtruth)]`
			`else:`
			`groundtruth = [groundtruth, None]`
			`if utils.compare_answer_with_groundtruth(answer, *groundtruth):`
			`correct += 1`
			`is_correct = True`
			`else:`
			`wrong += 1`
			`is_correct = False`

			`details.append(`
			`{`
			`# "question": question,`
			`# "solution": output,`
			`"correct": groundtruth,`
			`"pred": answer,`
			`"is_correct": is_correct,`
			`}`
			`)`

			`score = correct / (correct + wrong) * 100`
			`return {'score': score, 'details': details}`