OpenCompass/opencompass/datasets/OpenHuEval/HuSimpleQA.py

import json
import os

from datasets import Dataset, DatasetDict

from ..base import BaseDataset

JUDGE_PROMPT = {
    'system_prompt':
    """
- Role: Hungarian General Knowledge Quiz Review Expert
- Background: Hungarian General Knowledge Quiz involves many aspects of \
Hungarian history, culture, geography, politics, people, etc., \
and a professional review expert is needed to ensure the accuracy \
of the answers.
- Goals: According to the given questions and reference answers, \
accurately judge whether the answers entered by the user are correct, \
and give the evaluation results: "CORRECT", "INCORRECT", "NOT_ATTEMPTED".
- Evaluation criteria:
    - "Correct":
        - Predict answer fully contain the important information \
in the gold_target.
        - Predict answer do not contain any information that contradicts \
the gold_target.
        - Only semantic meaning matters; capitalization, punctuation, \
grammar, and order don't matter.
        - Hedging and guessing are permissible, provided that the \
gold_target is fully included and the response contains \
            no incorrect information or contradictions.
        - The answer must be consistent with the scope of the question. \
For example, if the question asks “on which day was someone born,”\
the answer must specify the exact date, such as “January 3, 1997.”

    - "Not attempted":
        - Questions that the user has not attempted to answer should \
be marked as "NOT_ATTEMPTED".
        - The important information in the gold_target is not included \
in the answer.
        - No statements in the answer contradict the gold_target.

- Workflow:
    1. Receive questions, reference answers, and user answers.
    2. Compare the reference answers and user answers to determine \
whether they are consistent.
    3. Based on the judgment results, output the corresponding \
evaluation results.
- Constraints:
    - For grading questions where the gold_target is a number, \
the predicted_answer needs to be correct to the last significant figure \
in the gold answer. For example, consider a question \
“Hány látogató érkezett Magyarországra 2024-ben?” with gold_target “440k”.
        - predicted_answers “440k”, “444k”, and “435k” are all CORRECT.
        - predicted_answers “400k” and “413k” are INCORRECT.
        - predicted_answers “körülbelül 400k” and “több mint 300k” \
are considered NOT_ATTEMPTED because they neither confirm nor contradict \
the gold_target.
    - The gold_target may contain more information than the question. \
In such cases, the predicted_answer only needs to contain the information \
that is in the question.
        - For example, consider the question “Where was The Martian filmed \
to represent the NASA Johnson Space Center?” with the gold_target \
“Budapesti Bálna (HU)”. Either “Budapesti Bálna” or Bálna (építmény) \
would be considered a CORRECT answer.
    - Do not punish predicted_answers if they omit information that \
would be clearly inferred from the question.
        - For example, consider the question “Melyik városban található \
a Magyar Nemzeti Bank székhelye?” with the gold_target \
“Budapest, Magyarország”. The predicted_answer “Budapest” would be \
considered CORRECT, even though it does not include “Magyarország”.
        - Consider the question “Milyen díjat nyert a ‘Napsugár’ \
című magyar film a 2018-as Velencei Filmfesztiválon?” with the gold_target \
“Legjobb Forgatókönyv Díja”. The predicted_answer “Legjobb Forgatókönyv” \
would be considered CORRECT, because “díj” is presumed in the question.
        - For the question “Mi Dominik Szoboszlai magassága méterben?” \
with the gold_target “1,85 m”, the predicted_answer “1,85” would be \
considered CORRECT, because meters is specified in the question.
        - For the question “Mi Magyarország első női miniszterelnökének neve?”\
with the gold_target “Wittner Mária”, the predicted_answer “Mária” \
would be considered CORRECT, because the last name is presumed.
    - Do not punish for typos in people’s names if it’s clearly the same name.
        - For example, if the gold_target is “Nagy László”, you can consider \
the following predicted_answers as correct: “László Nagy”, “NAGY LÁSZLÓ”, \
or “nagy lászló”.

Example1:
Input:
```
{
    "question": "Melyik törvény foglalkozik a találmányok szabadalmi \
oltalmával az 1969-es jogalkotásban?",
    "gold_target": "1969. évi II. törvény",
    "predicted_answer": "Nem áll rendelkezésre internetes keresés, \
így nem tudom megválaszolni a kérdést. Azonban 1969-ben valóban elfogadták \
a szabadalmi védelmi törvényt."
}
```
Output:
```
{
    "evaluation": "NOT_ATTEMPTED"
}
```

Example2:
Input:
```
{
    "question": "Melyik városban született Tisza Lajos 1832. \
szeptember 12-én?",
    "gold_target": "Nagyváradon",
    "predicted_answer": "Nagyváradon született Tisza Lajos 1832. \
szeptember 12-én."
}
```
Output: # Although the answer is long, it accurately answers the question
```
{
    "evaluation": "CORRECT"
}
```

Example3:
```
Input:
{
    "question": "Melyik olimpián vett részt Bogár Pál kosárlabda-játékosként?",
    "gold_target": "1952-es Helsinki nyári olimpia",
    "predicted_answer": "1952 Helsinki olimpián."
}
```
Output: # The descriptions are slightly different, but they all refer to \
the same Olympic Games, so they are considered correct
```
{
    "evaluation": "CORRECT"
}
```

Example4:
Input:
```
{
    "question": "Melyik labdarúgócsapat kötődik Budapest XIX. kerületéhez, \
amely 14-szeres magyar bajnok?",
    "gold_target": "Budapest Honvéd FC",
    "predicted_answer": "Ferencváros"
}
```
Output: #Although Ferencváros is a very famous football club in Hungary, \
it has no connection with the 19th district of Budapest and its number of \
championships does not match the description in the question.
```
{
    "evaluation": "INCORRECT"
}
```

Example5:
Input:
```
{
    "question": "Milyen biztosítás bevezetését szabályozta egy 1952-es \
törvényerejű rendelet Magyarországon?",
    "gold_target": "kötelező tűz- és jégbiztosítás",
    "predicted_answer": "Kötelező tűzbiztosítás"
}
```
Output: # The predicted_answer does not include all correct answers
```
{
    "evaluation": "INCORRECT"
}
```
""",
    'user_prompt':
    """Please strictly follow the above example and requirements, \
evaluate the following answer. Input:
```
{
    "question": {question},
    "gold_target": {answer},
    "predicted_answer": {prediction}
}
```
Please respond strictly in JSON format. Do not include any additional text \
outside the JSON structure.
Output:

Please provide your evaluation results in the following json format by \
filling in the placeholders in []:
```
{
    "evaluation": ["CORRECT"/"INCORRECT"/"NOT_ATTEMPTED" ]
}
```"""
}


class HuSimpleQADataset(BaseDataset):

    @staticmethod
    def load(filepath, *args, **kwargs):
        assert os.path.isfile(filepath)
        assert filepath.endswith('.jsonl')
        dataset = DatasetDict()
        with open(filepath, 'r', encoding='utf-8') as fp:
            objs = [json.loads(line) for line in fp.readlines()]
        raw_data = []
        for obj in objs:
            question = obj['question']
            answer = obj['answer']
            user_prompt = JUDGE_PROMPT['user_prompt']
            user_prompt = user_prompt.replace('{question}', question)
            user_prompt = user_prompt.replace('{answer}', answer)
            raw_data.append(
                dict(question=question,
                     prompt=JUDGE_PROMPT['system_prompt'] + user_prompt,
                     references=obj))
        dataset = Dataset.from_list(raw_data)
        return dataset