[Feature] Add IFEval (#813)

* [Feature] Add IFEval * [Doc] add introduction of IFEval
2025-05-30 16:03:24 +08:00 · 2024-01-23 20:07:49 +08:00 · 2024-01-23 20:07:49 +08:00 · e059a5c2bf
commit e059a5c2bf
parent 3d9bb4aed7
9 changed files with 2189 additions and 0 deletions
--- a/configs/datasets/IFEval/IFEval.md
+++ b/configs/datasets/IFEval/IFEval.md
@ -0,0 +1,53 @@
+# IFEval
+## Introduction
+The following introduction comes from the abstract in [Instruction Following Evaluation for Large Language Models](arxiv.org/abs/2311.07911)
+
+```
+IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. 
+```
+
+## Official link
+
+### Paper
+
+[Instruction Following Evaluation for Large Language Models](arxiv.org/abs/2311.07911)
+
+### Repository
+
+[instruction_following_eval](github.com/google-research/google-research/tree/master/instruction_following_eval)
+
+## Examples
+Input example I:
+```
+Write a 300+ word summary of the wikipedia page \"https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli\". Do not use any commas and highlight at least 3 sections that has titles in markdown format, for example *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.
+```
+Output example I (from GPT-4):
+```
+Raymond III was the Count of Tripoli from 1152 to 1187 and Prince of Galilee and Tiberias in the Kingdom of Jerusalem. He was born in 1140 to Raymond II of Tripoli and Hodierna of Jerusalem. His mother was the daughter of Baldwin II of Jerusalem. \n\n*Early Life and Succession*\n\nRaymond III was only a child when his father was murdered. His mother Hodierna was regent until Raymond came of age. In 1155 Raymond married Eschiva the daughter of Walter I of Beirut. They had three children: Raymond IV Bertrand and a daughter who married Guy of Lusignan. \n\n*Reign*\n\nRaymond III's reign was marked by a series of conflicts and alliances with the Muslim world. He was involved in the defense of the Kingdom of Jerusalem against Nur ad-Din and later Saladin. He was also a key figure in the internal politics of the kingdom. He was a supporter of the queen mother Amalric of Jerusalem and opposed the succession of Guy of Lusignan. \n\n*Later Life and Death*\n\nIn 1187 Raymond III was part of the disastrous Battle of Hattin where the Christian forces were defeated by Saladin. He was one of the few to escape the battlefield but died later that year. His son Raymond IV succeeded him as Count of Tripoli. \n\nRaymond III's life and reign were marked by the complex politics and warfare of the Crusader states. He was a key figure in the defense of the Kingdom of Jerusalem and his death marked a significant loss for the Christian forces in the region. His legacy is a testament to the turbulent times in which he lived and the challenges faced by the Crusader states in their relations with the Muslim world.
+```
+Input example II:
+```
+"I am planning a trip to Japan, and I would like thee to write an itinerary for my journey in a Shakespearean style. You are not allowed to use any commas in your response."
+```
+Output example II (from GPT-4):
+```
+Hark! Hearken to the tale of thy journey to the land of the rising sun, Japan. \n\nUpon the morrow of thy arrival, thou shalt find thyself in Tokyo, the city of lights and towers. In the morn, thou shalt explore the grandeur of the Imperial Palace, a sight to behold. As the sun reaches its zenith, thou shalt partake in a feast of sushi, a delicacy of the land. As the day wanes, thou shalt wander through the electric streets of Akihabara, a realm of technology and animation.\n\nOn the second day, thou shalt journey to the ancient city of Kyoto. In the soft light of dawn, thou shalt visit the golden pavilion of Kinkaku-ji, a vision of beauty and tranquility. As the day unfolds, thou shalt wander through the bamboo grove of Arashiyama, a place of peace and serenity. As the sun sets, thou shalt partake in a traditional tea ceremony, a ritual of grace and harmony.\n\nOn the third day, thou shalt travel to the city of Hiroshima. In the morning, thou shalt pay thy respects at the Peace Memorial Park, a place of remembrance and hope. As the day progresses, thou shalt journey to the island of Miyajima, home to the floating torii gate, a sight of wonder and awe. As the day ends, thou shalt feast on okonomiyaki, a dish of delight and satisfaction.\n\nOn the fourth day, thou shalt return to Tokyo. In the morning, thou shalt ascend the Tokyo Skytree, a tower of height and grandeur. As the day unfolds, thou shalt explore the bustling streets of Shibuya, a realm of fashion and culture. As the sun sets, thou shalt dine in the vibrant district of Roppongi, a place of gastronomy and nightlife.\n\nOn the fifth and final day, thou shalt bid farewell to the land of the rising sun. With memories of beauty and wonder, thou shalt embark on thy journey home, carrying with thee the spirit of Japan. \n\nThus ends the tale of thy journey, a tale of exploration and discovery. May thy travels be filled with joy and adventure, and may the land of the rising sun forever hold a place in thy heart.
+```
+## Evaluation results
+
+```
+dataset    version    metric      mode      internlm2-chat-20b-hf
+---------  ---------  ----------  ------  -----------------------
+IFEval     3321a3     strict_acc  gen                        0.33
+IFEval     3321a3     loose_acc   gen                        0.4
+```
+
+## Reference
+```
+@article{zhou2023instruction,
+  title={Instruction-Following Evaluation for Large Language Models},
+  author={Zhou, Jeffrey and Lu, Tianjian and Mishra, Swaroop and Brahma, Siddhartha and Basu, Sujoy and Luan, Yi and Zhou, Denny and Hou, Le},
+  journal={arXiv preprint arXiv:2311.07911},
+  year={2023}
+}
+```
--- a/configs/datasets/IFEval/IFEval_gen.py
+++ b/configs/datasets/IFEval/IFEval_gen.py
@ -0,0 +1,4 @@
+from mmengine.config import read_base
+
+with read_base():
+    from .IFEval_gen_3321a3 import humaneval_datasets  # noqa: F401, F403
--- a/configs/datasets/IFEval/IFEval_gen_3321a3.py
+++ b/configs/datasets/IFEval/IFEval_gen_3321a3.py
@ -0,0 +1,33 @@
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.datasets import IFEvalDataset, IFEvaluator
+
+ifeval_reader_cfg = dict(
+    input_columns=['prompt'], output_column='reference')
+
+ifeval_infer_cfg = dict(
+    prompt_template=dict(
+        type=PromptTemplate,
+        template=dict(round=[
+            dict(
+                role='HUMAN',
+                prompt='{prompt}'),
+        ])),
+    retriever=dict(type=ZeroRetriever),
+    inferencer=dict(type=GenInferencer, max_out_len=1025))
+
+ifeval_eval_cfg = dict(
+    evaluator=dict(type=IFEvaluator),
+    pred_role='BOT',
+)
+
+ifeval_datasets = [
+    dict(
+        abbr='IFEval',
+        type=IFEvalDataset,
+        path='./data_1/ifeval/input_data.jsonl',
+        reader_cfg=ifeval_reader_cfg,
+        infer_cfg=ifeval_infer_cfg,
+        eval_cfg=ifeval_eval_cfg)
+]
--- a/opencompass/datasets/IFEval/evaluation_main.py
+++ b/opencompass/datasets/IFEval/evaluation_main.py
@ -0,0 +1,143 @@
+# flake8: noqa
+# yapf: disable
+
+# Copyright 2023 The Google Research Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import dataclasses
+from typing import Dict, Optional, Union
+
+from absl import flags
+
+import opencompass.datasets.ifeval.instructions_registry as instructions_registry
+
+_INPUT_DATA = flags.DEFINE_string('input_data',
+                                  None,
+                                  'path to input data',
+                                  required=True)
+
+_INPUT_RESPONSE_DATA = flags.DEFINE_string('input_response_data',
+                                           None,
+                                           'path to input response data',
+                                           required=False)
+
+_OUTPUT_DIR = flags.DEFINE_string(
+    'output_dir',
+    None,
+    'Output directory for inference and eval results.',
+    required=True,
+)
+
+
+@dataclasses.dataclass
+class InputExample:
+    key: int
+    instruction_id_list: list[str]
+    prompt: str
+    kwargs: list[Dict[str, Optional[Union[str, int]]]]
+
+
+@dataclasses.dataclass
+class OutputExample:
+    instruction_id_list: list[str]
+    prompt: str
+    response: str
+    follow_all_instructions: bool
+    follow_instruction_list: list[bool]
+
+
+def test_instruction_following_strict(
+    inp,
+    response,
+):
+    """Tests response to see if instrutions are followed."""
+    instruction_list = inp.instruction_id_list
+    is_following_list = []
+
+    for index, instruction_id in enumerate(instruction_list):
+        instruction_cls = instructions_registry.INSTRUCTION_DICT[
+            instruction_id]
+        instruction = instruction_cls(instruction_id)
+        print(index)
+        print(inp)
+        instruction.build_description(**inp.kwargs[index])
+        args = instruction.get_instruction_args()
+        if args and 'prompt' in args:
+            instruction.build_description(prompt=inp.prompt)
+
+        if response.strip() and instruction.check_following(response):
+            is_following_list.append(True)
+        else:
+            is_following_list.append(False)
+
+    return OutputExample(
+        instruction_id_list=inp.instruction_id_list,
+        prompt=inp.prompt,
+        response=response,
+        follow_all_instructions=all(is_following_list),
+        follow_instruction_list=is_following_list,
+    )
+
+
+def test_instruction_following_loose(
+    inp,
+    response,
+):
+    """Tests response for an upper bound for following instructions."""
+    r = response.split('\n')
+    response_remove_first = '\n'.join(r[1:]).strip()
+    response_remove_last = '\n'.join(r[:-1]).strip()
+    response_remove_both = '\n'.join(r[1:-1]).strip()
+    revised_response = response.replace('*', '')
+    revised_response_remove_first = response_remove_first.replace('*', '')
+    revised_response_remove_last = response_remove_last.replace('*', '')
+    revised_response_remove_both = response_remove_both.replace('*', '')
+    all_responses = [
+        response,
+        revised_response,
+        response_remove_first,
+        response_remove_last,
+        response_remove_both,
+        revised_response_remove_first,
+        revised_response_remove_last,
+        revised_response_remove_both,
+    ]
+    instruction_list = inp.instruction_id_list
+    is_following_list = []
+
+    for index, instruction_id in enumerate(instruction_list):
+        instruction_cls = instructions_registry.INSTRUCTION_DICT[
+            instruction_id]
+        instruction = instruction_cls(instruction_id)
+
+        instruction.build_description(**inp.kwargs[index])
+        args = instruction.get_instruction_args()
+        if args and 'prompt' in args:
+            instruction.build_description(prompt=inp.prompt)
+
+        is_following = False
+        for r in all_responses:
+            if r.strip() and instruction.check_following(r):
+                is_following = True
+                break
+
+        is_following_list.append(is_following)
+
+    return OutputExample(
+        instruction_id_list=inp.instruction_id_list,
+        prompt=inp.prompt,
+        response=response,
+        follow_all_instructions=all(is_following_list),
+        follow_instruction_list=is_following_list,
+    )
--- a/opencompass/datasets/IFEval/ifeval.py
+++ b/opencompass/datasets/IFEval/ifeval.py
@ -0,0 +1,53 @@
+import json
+
+from datasets import Dataset
+
+from opencompass.openicl.icl_evaluator import BaseEvaluator
+from opencompass.registry import LOAD_DATASET
+
+from ..base import BaseDataset
+from .evaluation_main import (InputExample, test_instruction_following_loose,
+                              test_instruction_following_strict)
+
+
+@LOAD_DATASET.register_module()
+class IFEvalDataset(BaseDataset):
+
+    @staticmethod
+    def load(path):
+        datasets = []
+        with open(path, 'r', encoding='utf-8') as file:
+            for line in file:
+                tmp = json.loads(line.strip())
+                dataset = dict(prompt=tmp['prompt'], reference=tmp)
+                datasets.append(dataset)
+        return Dataset.from_list(datasets)
+
+
+class IFEvaluator(BaseEvaluator):
+
+    def score(self, predictions, references):
+        results = []
+        for pred, refer in zip(predictions, references):
+            print(refer)
+            input = InputExample(
+                key=refer['key'],
+                instruction_id_list=refer['instruction_id_list'],
+                prompt=refer['prompt'],
+                kwargs=refer['kwargs'])
+            for kwarg in input.kwargs:
+                for k in list(kwarg.keys()):
+                    if kwarg[k] is None:
+                        kwarg.pop(k, None)
+            result = dict(
+                strict=test_instruction_following_strict(input, pred),
+                loose=test_instruction_following_loose(input, pred),
+            )
+            results.append(result)
+        strict = sum(
+            [result['strict'].follow_all_instructions
+             for result in results]) / len(results)
+        loose = sum(
+            [result['loose'].follow_all_instructions
+             for result in results]) / len(results)
+        return dict(strict_acc=strict, loose_acc=loose)
--- a/opencompass/datasets/IFEval/instructions.py
+++ b/opencompass/datasets/IFEval/instructions.py
--- a/opencompass/datasets/IFEval/instructions_registry.py
+++ b/opencompass/datasets/IFEval/instructions_registry.py
@ -0,0 +1,190 @@
+# Copyright 2023 The Google Research Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Registry of all instructions."""
+import opencompass.datasets.ifeval.instructions as instructions
+
+_KEYWORD = 'keywords:'
+
+_LANGUAGE = 'language:'
+
+_LENGTH = 'length_constraints:'
+
+_CONTENT = 'detectable_content:'
+
+_FORMAT = 'detectable_format:'
+
+_MULTITURN = 'multi-turn:'
+
+_COMBINATION = 'combination:'
+
+_STARTEND = 'startend:'
+
+_CHANGE_CASES = 'change_case:'
+
+_PUNCTUATION = 'punctuation:'
+
+INSTRUCTION_DICT = {
+    _KEYWORD + 'existence':
+    instructions.KeywordChecker,
+    _KEYWORD + 'frequency':
+    instructions.KeywordFrequencyChecker,
+    # TODO(jeffreyzhou): make a proper set of sentences to choose from
+    # _KEYWORD + "key_sentences": instructions.KeySentenceChecker,
+    _KEYWORD + 'forbidden_words':
+    instructions.ForbiddenWords,
+    _KEYWORD + 'letter_frequency':
+    instructions.LetterFrequencyChecker,
+    _LANGUAGE + 'response_language':
+    instructions.ResponseLanguageChecker,
+    _LENGTH + 'number_sentences':
+    instructions.NumberOfSentences,
+    _LENGTH + 'number_paragraphs':
+    instructions.ParagraphChecker,
+    _LENGTH + 'number_words':
+    instructions.NumberOfWords,
+    _LENGTH + 'nth_paragraph_first_word':
+    instructions.ParagraphFirstWordCheck,
+    _CONTENT + 'number_placeholders':
+    instructions.PlaceholderChecker,
+    _CONTENT + 'postscript':
+    instructions.PostscriptChecker,
+    _FORMAT + 'number_bullet_lists':
+    instructions.BulletListChecker,
+    # TODO(jeffreyzhou): Pre-create paragraph or use prompt to replace
+    # _CONTENT + "rephrase_paragraph": instructions.RephraseParagraph,
+    _FORMAT + 'constrained_response':
+    instructions.ConstrainedResponseChecker,
+    _FORMAT + 'number_highlighted_sections':
+    (instructions.HighlightSectionChecker),
+    _FORMAT + 'multiple_sections':
+    instructions.SectionChecker,
+    # TODO(tianjianlu): Re-enable rephrasing with preprocessing the message.
+    # _FORMAT + "rephrase": instructions.RephraseChecker,
+    _FORMAT + 'json_format':
+    instructions.JsonFormat,
+    _FORMAT + 'title':
+    instructions.TitleChecker,
+    # TODO(tianjianlu): Re-enable with specific prompts.
+    # _MULTITURN + "constrained_start": instructions.ConstrainedStartChecker,
+    _COMBINATION + 'two_responses':
+    instructions.TwoResponsesChecker,
+    _COMBINATION + 'repeat_prompt':
+    instructions.RepeatPromptThenAnswer,
+    _STARTEND + 'end_checker':
+    instructions.EndChecker,
+    _CHANGE_CASES + 'capital_word_frequency':
+    instructions.CapitalWordFrequencyChecker,
+    _CHANGE_CASES + 'english_capital':
+    instructions.CapitalLettersEnglishChecker,
+    _CHANGE_CASES + 'english_lowercase':
+    instructions.LowercaseLettersEnglishChecker,
+    _PUNCTUATION + 'no_comma':
+    instructions.CommaChecker,
+    _STARTEND + 'quotation':
+    instructions.QuotationChecker,
+}
+
+INSTRUCTION_CONFLICTS = {
+    _KEYWORD + 'existence': {_KEYWORD + 'existence'},
+    _KEYWORD + 'frequency': {_KEYWORD + 'frequency'},
+    # TODO(jeffreyzhou): make a proper set of sentences to choose from
+    # _KEYWORD + "key_sentences": instructions.KeySentenceChecker,
+    _KEYWORD + 'forbidden_words': {_KEYWORD + 'forbidden_words'},
+    _KEYWORD + 'letter_frequency': {_KEYWORD + 'letter_frequency'},
+    _LANGUAGE + 'response_language': {
+        _LANGUAGE + 'response_language',
+        _FORMAT + 'multiple_sections',
+        _KEYWORD + 'existence',
+        _KEYWORD + 'frequency',
+        _KEYWORD + 'forbidden_words',
+        _STARTEND + 'end_checker',
+        _CHANGE_CASES + 'english_capital',
+        _CHANGE_CASES + 'english_lowercase',
+    },
+    _LENGTH + 'number_sentences': {_LENGTH + 'number_sentences'},
+    _LENGTH + 'number_paragraphs': {
+        _LENGTH + 'number_paragraphs',
+        _LENGTH + 'nth_paragraph_first_word',
+        _LENGTH + 'number_sentences',
+        _LENGTH + 'nth_paragraph_first_word',
+    },
+    _LENGTH + 'number_words': {_LENGTH + 'number_words'},
+    _LENGTH + 'nth_paragraph_first_word': {
+        _LENGTH + 'nth_paragraph_first_word',
+        _LENGTH + 'number_paragraphs',
+    },
+    _CONTENT + 'number_placeholders': {_CONTENT + 'number_placeholders'},
+    _CONTENT + 'postscript': {_CONTENT + 'postscript'},
+    _FORMAT + 'number_bullet_lists': {_FORMAT + 'number_bullet_lists'},
+    # TODO(jeffreyzhou): Pre-create paragraph or use prompt to replace
+    # _CONTENT + "rephrase_paragraph": instructions.RephraseParagraph,
+    _FORMAT + 'constrained_response':
+    set(INSTRUCTION_DICT.keys()),
+    _FORMAT + 'number_highlighted_sections':
+    {_FORMAT + 'number_highlighted_sections'},
+    _FORMAT + 'multiple_sections': {
+        _FORMAT + 'multiple_sections',
+        _LANGUAGE + 'response_language',
+        _FORMAT + 'number_highlighted_sections',
+    },
+    # TODO(tianjianlu): Re-enable rephrasing with preprocessing the message.
+    # _FORMAT + "rephrase": instructions.RephraseChecker,
+    _FORMAT + 'json_format':
+    set(INSTRUCTION_DICT.keys()).difference(
+        {_KEYWORD + 'forbidden_words', _KEYWORD + 'existence'}),
+    _FORMAT + 'title': {_FORMAT + 'title'},
+    # TODO(tianjianlu): Re-enable with specific prompts.
+    # _MULTITURN + "constrained_start": instructions.ConstrainedStartChecker,
+    _COMBINATION + 'two_responses':
+    set(INSTRUCTION_DICT.keys()).difference({
+        _KEYWORD + 'forbidden_words', _KEYWORD + 'existence',
+        _LANGUAGE + 'response_language', _FORMAT + 'title',
+        _PUNCTUATION + 'no_comma'
+    }),
+    _COMBINATION + 'repeat_prompt':
+    set(INSTRUCTION_DICT.keys()).difference(
+        {_KEYWORD + 'existence', _FORMAT + 'title',
+         _PUNCTUATION + 'no_comma'}),
+    _STARTEND + 'end_checker': {_STARTEND + 'end_checker'},
+    _CHANGE_CASES + 'capital_word_frequency': {
+        _CHANGE_CASES + 'capital_word_frequency',
+        _CHANGE_CASES + 'english_lowercase',
+        _CHANGE_CASES + 'english_capital',
+    },
+    _CHANGE_CASES + 'english_capital': {_CHANGE_CASES + 'english_capital'},
+    _CHANGE_CASES + 'english_lowercase': {
+        _CHANGE_CASES + 'english_lowercase',
+        _CHANGE_CASES + 'english_capital',
+    },
+    _PUNCTUATION + 'no_comma': {_PUNCTUATION + 'no_comma'},
+    _STARTEND + 'quotation': {_STARTEND + 'quotation', _FORMAT + 'title'},
+}
+
+
+def conflict_make(conflicts):
+    """Makes sure if A conflicts with B, B will conflict with A.
+
+    Args:
+      conflicts: Dictionary of potential conflicts where key is instruction id
+        and value is set of instruction ids that it conflicts with.
+
+    Returns:
+      Revised version of the dictionary. All instructions conflict with
+      themselves. If A conflicts with B, B will conflict with A.
+    """
+    for key in conflicts:
+        for k in conflicts[key]:
+            conflicts[k].add(key)
+        conflicts[key].add(key)
+    return conflicts
--- a/opencompass/datasets/IFEval/instructions_util.py
+++ b/opencompass/datasets/IFEval/instructions_util.py
--- a/opencompass/datasets/init.py
+++ b/opencompass/datasets/init.py
@ -47,6 +47,7 @@ from .huggingface import *  # noqa: F401, F403
 from .humaneval import *  # noqa: F401, F403
 from .humanevalx import *  # noqa: F401, F403
 from .hungarian_math import *  # noqa: F401, F403
+from .IFEval import *  # noqa: F401, F403
 from .infinitebench import *  # noqa: F401, F403
 from .iwslt2017 import *  # noqa: F401, F403
 from .jigsawmultilingual import *  # noqa: F401, F403