[Feature] MuSR Datset Evaluation (#1689)

* MuSR Datset Evaluation * MuSR Datset Evaluation Add an assertion and a Readme.md
2025-05-30 16:03:24 +08:00 · 2024-11-14 20:42:12 +08:00 · 2024-11-14 20:42:12 +08:00 · e9e4b69ddb
commit e9e4b69ddb
parent d415439f9b
13 changed files with 1539 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -57,6 +57,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
 ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
 - **\[2024.11.14\]** OpenCompass now offers support for a sophisticated benchmark designed to evaluate complex reasoning skills — [MuSR](https://arxiv.org/pdf/2310.16049). Check out the [demo](configs/eval_musr.py) and give it a spin! 🔥🔥🔥
 - **\[2024.11.14\]** OpenCompass now supports the brand new long-context language model evaluation benchmark — [BABILong](https://arxiv.org/pdf/2406.10149). Have a look at the [demo](configs/eval_babilong.py) and give it a try! 🔥🔥🔥
 - **\[2024.10.14\]** We now support the OpenAI multilingual QA dataset [MMMLU](https://huggingface.co/datasets/openai/MMMLU). Feel free to give it a try! 🔥🔥🔥
 - **\[2024.09.19\]** We now support [Qwen2.5](https://huggingface.co/Qwen)(0.5B to 72B) with multiple backend(huggingface/vllm/lmdeploy). Feel free to give them a try! 🔥🔥🔥
--- a/configs/eval_musr.py
+++ b/configs/eval_musr.py
@ -0,0 +1,44 @@
 from mmengine.config import read_base
 import os.path as osp
 with read_base():
    from opencompass.configs.datasets.musr.musr_gen import musr_datasets
    # from opencompass.configs.models.hf_internlm.hf_internlm2_5_1_8b_chat import models
    from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import (
    models as lmdeploy_internlm2_5_7b_chat_model,
    )
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import (
        models as lmdeploy_qwen2_5_7b_instruct_model,
    )
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import (
        models as lmdeploy_qwen2_5_14b_instruct_model,
    )
    from opencompass.configs.models.yi.lmdeploy_yi_1_5_9b_chat import (
        models as lmdeploy_yi_1_5_9b_chat_model,
    )
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_32b_instruct import (
        models as lmdeploy_qwen2_5_32b_instruct_model,
    )
    from opencompass.configs.models.chatglm.lmdeploy_glm4_9b_chat import (
        models as lmdeploy_glm4_9b_chat_model,
    )
    from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import (
        models as lmdeploy_llama3_1_8b_instruct_model,
    )
    from opencompass.configs.models.mistral.lmdeploy_ministral_8b_instruct_2410 import (
        models as lmdeploy_ministral_8b_instruct_2410_model,
    )
    from opencompass.configs.models.gemma.lmdeploy_gemma_9b_it import (
        models as lmdeploy_gemma_9b_it_model,
    )
    from opencompass.configs.models.gemma.lmdeploy_gemma_27b_it import (
        models as lmdeploy_gemma_27b_it_model,
    )
    from opencompass.configs.summarizers.groups.musr_average import summarizer
 datasets = [*musr_datasets]
 models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
 base_exp_dir = 'outputs/musr/'
 work_dir = osp.join(base_exp_dir, 'musr_eval')
--- a/opencompass/configs/datasets/musr/README.md
+++ b/opencompass/configs/datasets/musr/README.md
@ -0,0 +1,75 @@
 # MuSR: Multistep Soft Reasoning Dataset
 MuSR (Multistep Soft Reasoning) is a dataset designed to evaluate language models (LLMs) on complex reasoning tasks embedded in natural language narratives. Created to challenge state-of-the-art models like GPT-4 and others, MuSR emphasizes nuanced reasoning across different domains, including social and physical reasoning, commonsense reasoning, and planning, with tasks framed within realistic scenarios such as murder mysteries, object placements, and team allocations.
 ## Overview
 ### Purpose
 Current large language models can perform complex tasks through prompting techniques like chain-of-thought reasoning. However, robust multistep reasoning remains challenging. MuSR addresses these limitations by evaluating LLM performance on tasks involving multistep reasoning in three domains:
 - **Murder Mysteries**: Requires social and physical deductive reasoning.
 - **Object Placements**: Tests observational and theory-of-mind reasoning.
 - **Team Allocations**: Focuses on social reasoning and constraint satisfaction.
 ### Dataset Construction
 MuSR instances are generated using a neurosymbolic synthetic-to-natural narrative generation algorithm. This approach allows for the creation of complex reasoning instances that combine structured reasoning trees with natural language narratives, challenging both direct and nuanced inference capabilities in LLMs.
 MuSR's dataset consists of:
 - **Murder Mysteries**: Scenarios with suspects, motives, and opportunities requiring deductive inference.
 - **Object Placements**: Scenarios where individuals' observations inform reasoning about object locations.
 - **Team Allocations**: Scenarios that simulate social relationships and teamwork for optimal task assignments.
 ### Dataset Access
 MuSR dataset is publicly available, with instructions provided on the [GitHub Project](https://github.com/Zayne-Sprague/MuSR). You can download the dataset and use pre-defined prompts or create your own configurations.
 ### Evaluation
 1. Install dependencies and configure the environment.
 2. Run evaluations using `opencompass configs/eval_musr.py` to assess LLM performance.
 3. Analyze results against human performance benchmarks.
 ### Example Command
 ```bash
 opencompass configs/eval_musr.py
 ```
 ## Baselines and Results
 MuSR includes baseline results for multiple LLMs evaluated with chain-of-thought and advanced reasoning strategies. These benchmarks assess model accuracy on reasoning tasks across the three domains.
 | Domain           | Baseline Accuracy (GPT-4) | Human Performance |
 |------------------|---------------------------|--------------------|
 | Murder Mystery   | 80.4%                     | 94.1%             |
 | Object Placement | 60.9%                     | 95.0%             |
 | Team Allocation  | 68.4%                     | 100%              |
 | dataset | version | metric | mode | internlm2_5-7b-chat-turbomind | qwen2.5-7b-instruct-turbomind | qwen2.5-14b-instruct-turbomind | yi-1.5-9b-chat-turbomind | qwen2.5-32b-instruct-turbomind | glm-4-9b-chat-turbomind | llama-3_1-8b-instruct-turbomind | ministral-8B-instruct-2410-turbomind | gemma-2-9b-it-turbomind | gemma-2-27b-it-turbomind |
 |----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | -----|
 | musr_murder_mysteries | a5ce30 | accuracy | gen | 59.20 | 63.20 | 76.00 | 68.80 | 78.80 | 71.20 | 73.60 | 73.60 | 74.80 | 77.20 |
 | musr_object_placements | a5ce30 | accuracy | gen | 54.69 | 56.25 | 57.42 | 52.73 | 66.02 | 49.22 | 57.42 | 60.94 | 60.94 | 62.11 |
 | musr_team_allocation | a5ce30 | accuracy | gen | 39.20 | 32.40 | 55.60 | 40.00 | 67.60 | 50.40 | 46.00 | 36.40 | 40.80 | 41.20 |
 | musr_average | - | naive_average | gen | 51.03 | 50.62 | 63.01 | 53.84 | 70.81 | 56.94 | 59.01 | 56.98 | 58.85 | 60.17 |
 ## Citation
 If you use MuSR in your research, please cite:
 ```bibtex
@misc{sprague2024musrtestinglimitschainofthought,
      title={MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning}, 
      author={Zayne Sprague and Xi Ye and Kaj Bostrom and Swarat Chaudhuri and Greg Durrett},
      year={2024},
      eprint={2310.16049},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2310.16049}, 
 }
 ```
 ## Details
 For further details, please refer to the MuSR paper [here](https://arxiv.org/abs/2310.16049).
--- a/opencompass/configs/datasets/musr/musr_gen.py
+++ b/opencompass/configs/datasets/musr/musr_gen.py
@ -0,0 +1,135 @@
 from opencompass.datasets import MusrDataset, MusrEvaluator
 from opencompass.openicl import PromptTemplate, ZeroRetriever, GenInferencer
 DATASET_CONFIGS = {
    'murder_mysteries': {
        'abbr': 'musr_murder_mysteries',
        'name': 'murder_mysteries',
        'path': 'opencompass/musr',  
        'reader_cfg': dict(
            input_columns=['context', 'question_text', 'question', 'answer', 'choices', 'choices_str', 'intermediate_trees', 'intermediate_data', 'prompt', 'system_prompt', 'gold_answer', 'scidx', 'self_consistency_n', 'ablation_name'],
            output_column='gold_answer',
        ),
        'infer_cfg': dict(
            prompt_template=dict(
                type=PromptTemplate,
                template=dict(
                    begin=[
                        dict(
                            role='SYSTEM',
                            fallback_role='HUMAN',
                            prompt='{system_prompt}'
                        )
                    ],
                    round=[
                        dict(
                            role='HUMAN',
                            prompt='{prompt}'
                        ),
                    ]
                ),
            ),
            retriever=dict(type=ZeroRetriever),
            inferencer=dict(type=GenInferencer, max_out_len=512),
        ),
        'eval_cfg': dict(
            evaluator=dict(
                type=MusrEvaluator,
                answer_index_modifier=1,
                self_consistency_n=1
            ),
        ),
    },
    'object_placements': {
        'abbr': 'musr_object_placements',
        'name': 'object_placements',
        'path': 'opencompass/musr',
        'reader_cfg': dict(
            input_columns=['context', 'question_text', 'question', 'answer', 'choices', 'choices_str', 'intermediate_trees', 'intermediate_data', 'prompt', 'system_prompt', 'gold_answer', 'scidx', 'self_consistency_n', 'ablation_name'],
            output_column='gold_answer',
        ),
        'infer_cfg': dict(
            prompt_template=dict(
                type=PromptTemplate,
                template=dict(
                    begin=[
                        dict(
                            role='SYSTEM',
                            fallback_role='HUMAN',
                            prompt='{system_prompt}'
                        )
                    ],
                    round=[
                        dict(
                            role='HUMAN',
                            prompt='{prompt}'
                        ),
                    ]
                ),
            ),
            retriever=dict(type=ZeroRetriever),
            inferencer=dict(type=GenInferencer, max_out_len=512),
        ),
        'eval_cfg': dict(
            evaluator=dict(
                type=MusrEvaluator,
                answer_index_modifier=1,
                self_consistency_n=1
            ),
        ),
    },
    'team_allocation': {
        'abbr': 'musr_team_allocation',
        'name': 'team_allocation',
        'path': 'opencompass/musr',
        'reader_cfg': dict(
            input_columns=['context', 'question_text', 'question', 'answer', 'choices', 'choices_str', 'intermediate_trees', 'intermediate_data', 'prompt', 'system_prompt', 'gold_answer', 'scidx', 'self_consistency_n', 'ablation_name'],
            output_column='gold_answer',
        ),
        'infer_cfg': dict(
            prompt_template=dict(
                type=PromptTemplate,
                template=dict(
                    begin=[
                        dict(
                            role='SYSTEM',
                            fallback_role='HUMAN',
                            prompt='{system_prompt}'
                        )
                    ],
                    round=[
                        dict(
                            role='HUMAN',
                            prompt='{prompt}'
                        ),
                    ]
                ),
            ),
            retriever=dict(type=ZeroRetriever),
            inferencer=dict(type=GenInferencer, max_out_len=512),
        ),
        'eval_cfg': dict(
            evaluator=dict(
                type=MusrEvaluator,
                answer_index_modifier=1,
                self_consistency_n=1
            ),
        ),
    },
 }
 musr_datasets = []
 for config in DATASET_CONFIGS.values():
    dataset = dict(
        abbr=config['abbr'],
        type=MusrDataset,
        path=config['path'],
        name=config['name'],
        reader_cfg=config['reader_cfg'],
        infer_cfg=config['infer_cfg'],
        eval_cfg=config['eval_cfg'],
    )
    musr_datasets.append(dataset)
--- a/opencompass/configs/summarizers/groups/musr_average.py
+++ b/opencompass/configs/summarizers/groups/musr_average.py
@ -0,0 +1,19 @@
 summarizer = dict(
    dataset_abbrs=[
        'musr_murder_mysteries',
        'musr_object_placements',
        'musr_team_allocation',
        'musr_average'
    ],
    summary_groups=[
        {
            'name': 'musr_average',
            'subsets': [
                'musr_murder_mysteries',
                'musr_object_placements',
                'musr_team_allocation',
            ],
        }
    ],
 )
--- a/opencompass/datasets/init.py
+++ b/opencompass/datasets/init.py
@ -87,6 +87,7 @@ from .mmlu_pro import *  # noqa: F401, F403
 from .MMLUArabic import *  # noqa: F401, F403
 from .mmmlu import *  # noqa: F401, F403
 from .multirc import *  # noqa: F401, F403
 from .musr import *  # noqa: F401, F403
 from .narrativeqa import *  # noqa: F401, F403
 from .natural_question import *  # noqa: F401, F403
 from .natural_question_cn import *  # noqa: F401, F403
--- a/opencompass/datasets/musr/init.py
+++ b/opencompass/datasets/musr/init.py
@ -0,0 +1 @@
 from .musr import *  # noqa: F401, F403
--- a/opencompass/datasets/musr/murder_mystery_solved_ex.py
+++ b/opencompass/datasets/musr/murder_mystery_solved_ex.py
@ -0,0 +1,81 @@
 # flake8: noqa: E501
 story = """
 In the smoke-filled haze of a thriving jazz club, Alice met her explosive end, leaving Detective Winston to sift through the suspects: Eugene, the shy pianist, and Gabrielle, the sassy club singer.
 While seated at his desk at the precinct, Winston received a phone call from a certain observant local bartender, tipping off the police about a harsh row peaking in a nearby jazz club. He signaled to his partner as they promptly dispatched to the scene, already ringing with sirens and a restless crowd.
 With the police line restraining the horde, the jazz club was undergoing a full round-up as Winston approached the informative bartender. The bartender was engrossed in his account to the officers about a raucous, punch throwing fight Eugene was part of, to his best recollection. Winston remembered Eugene, a jazz fanatic—lurking around the jazz corners more often than anyone else could recount.
 In the heart of the upheaval, lay a woman sprawled on the floor, later identified as Alice, a frequent face at the jazz scene and a financial analyst deeply engrossed in financial transactions. In public, Alice had made her concerns known about her discovery of fraudulent transactions at the bank, promising to report the same to the authorities. Eugene, remembered conspicuously for being a bank teller at the same bank Alice worked at, suddenly seemed closely linked.
 Eugene’s arrest was far from hushed, with the local news broadcasting the progressing drama live, catching sight of Eugene curtailed in handcuffs. Concurrently, it was ascertained—Eugene was a member of the jazz club. This evidence backed by a jazz club membership card retrieved from his wallet during the arrest.
 Just a few steps away, he noticed a man in a suit, the bouncer, a calm figure amid the bedlam. In their conversation, the bouncer corroborated that he had indeed seen Eugene involved in a heated scuffle, landing a few punches. The whisperings were starting to gain momentum, since Eugene was believed to be on the losing end of a lawsuit—a battle courtesy of Alice charging Eugene with the financial fraud she had publicly vowed to expose.
 Eugene was known for his frequent presence at the jazz club and on top of that, was an actual member. Therefore, it was hardly a leap to presume Alice meeting her untimely end at the club was no mere happenstance. The jazz club, despite its dim lights and pulsating music, was a public place easily accessible to all, including potential suspects like Eugene and, sadly, the ill-starred Alice.
 Det. Winston knew he was now tasked with a cryptic puzzle. A bank teller, embroiled in suspected fraud and a lawsuit, a jazz club murder scene and a local financial analyst—all woven into a ghastly murder mystery. He sighed in distaste as Eugene was escorted away—a man still oblivious to the chain of events waiting for him. But Winston knew, the night had only just begun for him.
 Winston stared down at the crumpled microphone on the floor. He picked it up gingerly, turning it in his hand. The club was in disarray, debris scattered like confetti. The lab boys were still picking pieces of the grenade apart.
 "Gabrielle's microphone," the coroner confirmed, barely looking up from his task.
 "Give him the once-over for evidence," Winston said, handing the microphone to a nearby officer.
 Leaving the club behind him, Winston sighed heavily. The world of jazz had taken a dark turn that night. Alice, the acclaimed critic with her sarcastic wit and keen critical eye, had been last seen alive here. Her purse lay in the club untouched, a testament to the abruptness of the event.
 Gabrielle had been working as a war correspondent. Winston had read her articles. They were richly detailed, passionate, and highlighted the harsh reality of war zones. Gabrielle hadn't been shy about sharing her experiences or publicly criticizing the military in her pieces. She boldly interviewed military personnel and spent extended periods in conflict zones.
 Alice, though, never missed a chance to pick apart Gabrielle's articles. The vitriolic snippets in Alice’s column were regular features and Gabrielle's staunch defense of her articles, her work in the jazz scene, did little against Alice's respected reputation.
 The tension between them was palpable. Alice had been awarded a major journalist award that Gabrielle had desired. This only deepened their rivalry, with Gabrielle feeling overlooked for this recognition in the Jazz scene.
 Winston cast his gaze over the club once more—a hub of pulsating rhythms now eerily silent.
 A significant part of the evening was Gabrielle's recorded interview with Alice. It played on the local radio, their professional rivalry subtly echoing under their professional demeanor.
 With a deep breath, Winston knew he had a tall task ahead. The jazz club, where Alice was last seen alive was now shrouded in an eerie silence, the vibrant rhythms of what used to be a lively night echoing in the abandoned stage. It was up to him to piece together the missing notes and bring the symphony of this unsolved case to a satisfying finale.
 Who is the most likely murderer?
 Pick one of the following choices:
 1 - Eugene
 2 - Gabrielle
 You must pick one option. Before selecting a choice, explain your reasoning step by step. The murderer needs to have a means (access to weapon), motive (reason to kill the victim), and opportunity (access to crime scene) in order to have killed the victim. Innocent suspects may have two of these proven, but not all three. An innocent suspect may be suspicious for some other reason, but they will not have all of motive, means, and opportunity established.
 If you believe that both suspects have motive, means, and opportunity, you should make an educated guess pick the one for whom these are best established. If you believe that neither suspect has all three established, then choose the suspect where these are most clearly established. Explain your reasoning step by step before you answer. Finally, the last thing you generate should be "ANSWER: (your answer here, including the choice number)"
 """.strip()
 reasoning = """
 Let's break this down step-by-step by first deducing which of the two suspects has a means, motive, and opportunity.
 We will start with Eugene.
 Eugene was being sued by Alice for fraudulent transactions.  The charge was also very public.  Both of these facts point to Eugene having a strong motive.
 Because Eugene has a jazz club membership, and we can deduce that the jazz club membership belongs to the same club Alice was murdered in, we can assume Eugene has an opportunity to commit the crime.
 Although we know Eugene is aggressive because he was throwing punches in the story, we do not know if he has access to the murder weapon.  Because he does not have access to a grenade, he does not have a means.
 Let's review Gabrielle next.
 Gabrielle's purse was found at the scene of the crime, and we can then assume she had the opportunity to kill Alice.
 Because Gabrielle has been in conflict zones with military personnel, it's possible that she has access to a grenade.  We can say that Gabrielle has a potential means to kill the victim.
 Finally, it appears that Gabrielle and Alice had a rivalry over journalism, which could have boiled up into physical action.  Because of this, we can say that Gabrielle has a potential motive to kill the victim.
 Now, reviewing the evidence, we see that:
 Eugene has a motive and opportunity but no means.
 Gabrielle has a motive, means, and opportunity.
 Therefore, Gabrielle is the most likely murderer.
 ANSWER: 2
 """.strip()
 murder_mystery_solved_ex = f'{story}\n\n{reasoning}'
--- a/opencompass/datasets/musr/musr.py
+++ b/opencompass/datasets/musr/musr.py
@ -0,0 +1,309 @@
 # flake8: noqa: E501
 import json
 import os.path as osp
 from datasets import Dataset
 from opencompass.datasets.base import BaseDataset
 from opencompass.openicl import BaseEvaluator
 from opencompass.registry import ICL_EVALUATORS, LOAD_DATASET
 from opencompass.utils import get_data_path
 from .murder_mystery_solved_ex import murder_mystery_solved_ex
 from .object_placements_solved_ex import object_placements_solved_ex
 from .team_allocation_solved_ex import team_allocation_solved_ex
 from .tree import LogicTree
 DATASET_CONFIGS = {
    'murder_mysteries': {
        'file_name':
        'murder_mysteries.json',
        'ex':
        murder_mystery_solved_ex,  # write user example here
        'system_prompt':
        'You are a helpful assistant that will answer the questions given by the user.',
        'hint':
        ('Before selecting a choice, explain your reasoning step by step. '
         'The murderer needs to have a means (access to weapon), motive (reason to kill the victim), '
         'and opportunity (access to crime scene) in order to have killed the victim. '
         'Innocent suspects may have two of these proven, but not all three. '
         'An innocent suspect may be suspicious for some other reason, but they will not have all of motive, '
         'means, and opportunity established.\n\n'
         'If you believe that both suspects have motive, means, and opportunity, you should make an educated guess '
         'and pick the one for whom these are best established. If you believe that neither suspect has all '
         'three established, then choose the suspect where these are most clearly established.'
         ),
        'hint_before_question':
        False,
        'answer_index_modifier':
        1
    },
    'object_placements': {
        'file_name':
        'object_placements.json',
        'ex':
        object_placements_solved_ex,
        'skip_ablated':
        True,
        'ablation_depth_modifier':
        2,
        'system_prompt':
        'You are a helpful assistant that will answer the questions given by the user.',
        'hint':
        ('Based on this story, we want to identify where someone believes that a certain object is at the end of '
         'the story. In order to do that, you need to read the story and keep track of where they think the object '
         'is at each point. When an object is moved, the person may observe its new location if they saw it move.\n\n'
         'To see where an object ends up, they must be able to see the location that it moves to and not be too '
         'distracted by what they are doing. If they do not observe the object moving, then they will still believe '
         'it to be in the last location where they observed it.'),
        'hint_before_question':
        True,
        'answer_index_modifier':
        1
    },
    'team_allocation': {
        'file_name':
        'team_allocation.json',
        'ex':
        team_allocation_solved_ex,
        'system_prompt':
        'You are a helpful assistant that will answer the questions given by the user.',
        'hint':
        ('The story should allow you to determine how good each person is at a skill. Roughly, each person is '
         'either great, acceptable, or bad at a task. We want to find an optimal assignment of people to tasks '
         'that uses their skills as well as possible. In addition, one task will have to have two people assigned '
         'to it. The effectiveness of their teamwork (great team, acceptable team, or bad team) also impacts the '
         'overall quality of the assignment.\n\n'
         'When two people need to work on a task and one is bad at it, they don\'t necessarily benefit from the '
         'other person being good, unless they work well together.\n\n'
         'With different strengths, weaknesses, and interpersonal dynamics at play, you should allocate your team '
         'to find the single assignment to ensure that the tasks overall are completed as effectively as possible.'
         ),
        'hint_before_question':
        False,
        'answer_index_modifier':
        1
    }
 }
@LOAD_DATASET.register_module()
 class MusrDataset(BaseDataset):
    """MuSR.
    Args:
        path (str): path to dataset
        name (str): name of dataset
        self_consistency_n (int)
        exclude_contrastive_examples (bool): Whether to exclude contrastive examples
        reverse_contrastive_sample (bool): Whether to reverse the selection of contrastive samples
        skip_ablated (bool): Whether to skip ablated samples
        offset (int): Starting offset for the dataset
        sample_size (int): Sample size, None indicates using the entire dataset.
    """
    @staticmethod
    def load(path,
             name,
             self_consistency_n=1,
             exclude_contrastive_examples=False,
             reverse_contrastive_sample=False,
             skip_ablated=False,
             randomize=False,
             offset=0,
             sample_size=None,
             **kwargs):
        """Load the dataset and flatten fields while constructing prompts,
        taking self_consistency_n and ablations into account."""
        if name not in DATASET_CONFIGS:
            raise ValueError(
                f'Dataset name {name} not supported. Must be one of {list(DATASET_CONFIGS.keys())}'
            )
        config = DATASET_CONFIGS[name]
        path = get_data_path(path)
        file_path = osp.join(path, config['file_name'])
        with open(file_path, 'r', encoding='utf-8') as f:
            dataset = json.load(f)
        filtered_dataset = []
        hashes_done = []
        for example in dataset:
            if exclude_contrastive_examples and example['questions'][0].get('intermediate_data') and \
               len(example['questions'][0].get('intermediate_data')) > 0 and \
               example['questions'][0]['intermediate_data'][0].get('story_hash_id'):
                story_hash = example['questions'][0]['intermediate_data'][0][
                    'story_hash_id']
                if story_hash in hashes_done:
                    if reverse_contrastive_sample:
                        filtered_dataset.append(example)
                    else:
                        continue
                elif not reverse_contrastive_sample:
                    filtered_dataset.append(example)
                hashes_done.append(story_hash)
            else:
                filtered_dataset.append(example)
        filtered_dataset = filtered_dataset[
            offset:offset +
            min(len(filtered_dataset), sample_size) if sample_size else None]
        ablations = [
            # {'prompt': 'regular', 'name': 'regular'},
            # {'prompt': 'cot', 'name': 'cot'},
            {
                'prompt': 'cot+',
                'name': 'cot+'
            },
        ]
        # create prompts
        flattened_data = []
        for example in filtered_dataset:
            context = example['context']
            questions = example['questions']
            for question in questions:
                choices_list = question['choices']
                choices_str = '\n'.join([
                    f'{idx + 1} - {choice}'
                    for idx, choice in enumerate(choices_list)
                ])
                gold_answer = question['answer'] + config.get(
                    'answer_index_modifier', 1)
                for ablation in ablations:
                    prompt_style = ablation.get('prompt', 'cot+')
                    ablation_name = ablation.get('name', 'cot+')
                    for scidx in range(self_consistency_n):
                        ex_str = ''
                        if ablation.get('use_example') and config.get('ex'):
                            ex_str = (
                                'Here is an example of solving the task:\n\n' +
                                config.get('ex') +
                                '\n\nThis is the end of the example. The real task is below.\n\n---\n\n'
                            )
                        if prompt_style == 'regular':
                            prompt = f'{ex_str}{context}\n\n{question["question"]}\n\n' \
                                     f'Pick one of the following choices:\n{choices_str}\n\n' \
                                     'You must pick one option. Finally, the last thing you generate should be "ANSWER: (your answer here, include the choice number)"'
                        elif prompt_style == 'cot':
                            prompt = f'{ex_str}{context}\n\n{question["question"]}\n\n' \
                                     f'Pick one of the following choices:\n{choices_str}\n\n' \
                                     'You must pick one option. Explain your reasoning step by step before you answer. ' \
                                     'Finally, the last thing you generate should be "ANSWER: (your answer here, include the choice number)"'
                        elif prompt_style == 'cot+':
                            if config.get('hint_before_question'):
                                prompt = f'{ex_str}{context}\n\n{config["hint"]}\n\n{question["question"]}\n\n' \
                                         f'Pick one of the following choices:\n{choices_str}\n\n' \
                                         'You must pick one option. Explain your reasoning step by step before you answer. ' \
                                         'Finally, the last thing you generate should be "ANSWER: (your answer here, including the choice number)"'
                            else:
                                prompt = f'{ex_str}{context}\n\n{question["question"]}\n\n' \
                                         f'Pick one of the following choices:\n{choices_str}\n\n' \
                                         f'You must pick one option. {config["hint"]} Explain your reasoning step by step before you answer. ' \
                                         'Finally, the last thing you generate should be "ANSWER: (your answer here, including the choice number)"'
                        else:
                            if len(question['intermediate_trees']
                                   ) == 0 or config.get('skip_ablated', False):
                                continue
                            prompt = f'{ex_str}Answer the following questions given the list of facts per answer choice.\n\n'
                            for c, t in zip(choices_str.split('\n'),
                                            question['intermediate_trees']):
                                # extract facts from intermediate_trees
                                facts = list(
                                    set([
                                        x.value for x in
                                        LogicTree.from_json(t).get_facts(
                                            include_cs=ablation.get(
                                                'include_cs', False),
                                            include_deductions_from_level=-1,
                                            no_facts_after_depth=ablation.get(
                                                'no_facts_after_depth', 3) +
                                            config.get(
                                                'ablation_depth_modifier', 0))
                                    ]))
                                if config.get('allow_sorted_facts', True):
                                    facts = sorted(facts)
                                facts_str = '\n'.join(
                                    [f'- {fact}' for fact in facts])
                                prompt += f'Facts for Choice {c}:\n{facts_str}\n\n'
                            prompt += f'Given the list of facts per answer choice, answer the following question\n\n' \
                                      f'{question["question"]}\n\n' \
                                      f'Pick one of the following choices:\n{choices_str}\n\n' \
                                      'You must pick one option. After you have found the answer, say it in this format "ANSWER: (your answer here, include the choice number)"'
                        flattened_example = {
                            'context':
                            context,
                            'question_text':
                            question['question'],
                            'question':
                            question,
                            'answer':
                            question['answer'],
                            'choices':
                            choices_list,
                            'choices_str':
                            choices_str,
                            'intermediate_trees':
                            question.get('intermediate_trees', []),
                            'intermediate_data':
                            question.get('intermediate_data', []),
                            'prompt':
                            prompt,
                            'system_prompt':
                            config.get('system_prompt', ''),
                            'gold_answer':
                            gold_answer,
                            'scidx':
                            scidx,  # self-consistency index
                            'self_consistency_n':
                            self_consistency_n,
                            'ablation_name':
                            ablation_name,
                        }
                        flattened_data.append(flattened_example)
        dataset = Dataset.from_list(flattened_data)
        return dataset
@ICL_EVALUATORS.register_module()
 class MusrEvaluator(BaseEvaluator):
    def __init__(self, answer_index_modifier=1, self_consistency_n=1):
        self.answer_index_modifier = answer_index_modifier
        self.self_consistency_n = self_consistency_n
    def score(self, predictions, references):
        correct = 0
        assert len(predictions) == len(
            references
        ), 'Predictions and references must have the same length!'
        total = len(predictions)
        for pred, ref in zip(predictions, references):
            if 'ANSWER:' in pred:
                answer_line = [
                    line for line in pred.split('\n') if 'ANSWER:' in line
                ]
                if answer_line:
                    answer = answer_line[0].split('ANSWER:')[-1].strip()
                    import re
                    match = re.search(r'\d+', answer)
                    if match:
                        pred_answer = int(match.group())
                        if pred_answer == ref:
                            correct += 1
        accuracy = 100 * correct / total if total > 0 else 0
        return {'accuracy': accuracy}
--- a/opencompass/datasets/musr/object_placements_solved_ex.py
+++ b/opencompass/datasets/musr/object_placements_solved_ex.py
@ -0,0 +1,53 @@
 # flake8: noqa: E501
 story = '''
 Petra, the dedicated housewife, felt a thrill at the thought of her surprise anniversary dinner for her husband, Daniel. She had been skillfully maneuvering around Daniel's eagerness to pitch in without disappointing him or giving up her surprise.
 Daniel, ever-the-observant-husband, noted Petra's unusual enthusiasm about the day's menu. Despite not knowing the details, he appreciated her effort and wanted to help—silently, he decided to deploy his best skill—patiently awaiting his moment to help, maybe when Petra asked for something from the pantry. Amidst the excitement, there was Clara, their maid—ever diligent and efficient, trying to keep the surroundings perfect for this special evening.
 Tucked away, under the counter, was Petra's secret recipe book, her culinary treasure. Her solace in confusing times, her secret weapon during their flavorful adventures. While toward the back of the pantry, was the glass jar of Petra's favorite spice blends—something that Daniel was well aware of, in case an opportunity arose for him to assist or distract when Petra might need it.
 All three residents of the home were aware of each item's location. The secret recipe book under the counter, the glass jar in the pantry, and the anxious excitement that filled the air—a fragrance even more intoxicating than the delicious smells that would soon fill the kitchen.
 With tact and secrecy, Petra relocated her cherished recipe book from its hidden spot under the counter to its temporary home on the kitchen table. The pages were swiftly opened to reveal her secret recipes which she was eager to start preparing for the long-awaited anniversary surprise. While Petra was engrossed in her preparations, Clara continued her sweeping routine in the kitchen. Clara's steady broom strokes on the wooden floor echoed a little in the otherwise busy and humming kitchen. In the background, beyond the kitchen door, Daniel could be seen in the dining room, meticulously setting the table for the anticipated special dinner.
 The placement of the rooms allowed Clara to easily notice Petra's movements in her peripheral vision while she was executing her chores. Every move Petra made was observed in Clara's line of sight. Simultaneously, separated by the walls, Daniel was diligently arranging the tableware in the dining room which was separate from Petra's bustling kitchen.
 Hoping to spruce up the setting, Daniel delicately relocated a glass jar filled with decorative pebbles to the center of the dining table. His subtle contribution for the evening - a perfectly presentable table for their special anniversary dinner. Amidst the flurry of the special day's preparations, Clara diligently carried on with her duties in the upstairs bathroom, unseen from the dining room. Meanwhile, Petra was wholly engrossed in the allure of a new recipe in her cherished, hidden book which lay opened on the kitchen island, away from prying eyes of the dining room.
 In the middle of her usual tidying, Clara spotted Petra's treasured recipe book on the kitchen table. Ensuring it stayed clandestine, Clara carefully transferred it back to its usual hideaway spot beneath the counter. In the midst of the anniversary excitement, Clara deftly transferred Petra's secret weapon back to its hiding place when Daniel stepped out into the garage to retrieve extra utensils. Performing her duty with a sense of urgency, she made sure to move quietly to not disturb Petra, who was engrossed in the process of boiling a massive pot of pasta water on the stove.
 Despite the commotion and fervor in the kitchen, the hubbub did not stretch as far as the garage, which remained undisturbed by the domestic activity occurring in the main part of the house. Meanwhile, in the kitchen, Petra was oblivious to Clara's subtle maneuver while she busied herself at the stove, focused on making sure the large pot of water reached the perfect boil.
 In the end, the careful orchestration of duties by each individual within the house concluded in a harmonious anniversary celebration. The marks of a successful evening consisted of a delectable meal, a serene atmosphere, and the memory of a smooth, incident-free evening where everyone played their role to perfection.
 Based on this story, we want to identify where someone believes that a certain object is at the end of the story. In order to do that, you need to read the story and keep track of where they think the object is at each point. When an object is moved, the person may observe its new location if they saw it move.
 To see where an object ends up, they must be able to see the location that it moves to and not be too distracted by what they are doing. If they do not observe the object moving, then they will still believe it to be in the last location where they observed it.
 Which location is the most likely place Clara would look to find the glass jar given the story?
 Pick one of the following choices:
 1 - dining table
 2 - kitchen table
 3 - pantry
 4 - under counter
 You must pick one option. Explain your reasoning step by step before you answer. Finally, the last thing you generate should be "ANSWER: (your answer here, including the choice number)"
 '''.strip()
 reasoning = '''
 Let's solve this by thinking step-by-step. We want to know where Clara will check to find the glass jar, so let's track where Clara sees the glass jar throughout the story.
 At the beginning of the story, it is stated that "All three residents of the home were aware of each item's location... the glass jar in the pantry." From this, we can conclude that the first place in the story where Clara sees the glass jar is in the pantry.
 Throughout the story, the glass jar only moves once to the dining table. However, while Daniel was moving the glass jar, Clara was upstairs in the restroom carrying out her duties. It's highly unlikely that she saw Daniel move the glass jar, so we can assume that she still believes it to be in the pantry.
 Clara does go to the kitchen in the story and moves a recipe book from the kitchen table, but because it's the kitchen table and not the dining room table, we can assume she hasn't seen the glass jar there.
 Now, given the story and evidence, we can assume that Clara believes the glass jar to be in the pantry.
 ANSWER: 3
 '''.strip()
 object_placements_solved_ex = f'{story}\n\n{reasoning}'
--- a/opencompass/datasets/musr/team_allocation_solved_ex.py
+++ b/opencompass/datasets/musr/team_allocation_solved_ex.py
@ -0,0 +1,72 @@
 # flake8: noqa: E501
 story = '''
 In the quaint community of Midvale, the local school stood as a beacon of enlightenment, nurturing the minds of the next generation. The teachers, the lifeblood of this institution, were tasked with the noble duty of education, while the unsung heroes—the maintenance crew—ensured the smooth functioning of the school's infrastructure. Amidst this, three town residents, Angela, Greg, and Travis, found themselves at a juncture of life where they were presented with the opportunity to serve in one of these crucial roles. The challenge now lay in the hands of the manager, who had to assign them to either teaching or maintenance, a decision that would set the course for their contributions to the school.
 Angela was a fiercely independent woman, beset with a unique set of strengths and weaknesses. She was a woman of very few words, often finding it hard to articulate her thoughts and explain things clearly. Venturing into education seemed a maze with her apathetic attitude towards learning. She was also seen to be disinterested in reading and the literary field as a whole. This was a juxtaposition to her inability to contribute to maintenance duties because of her fear of tools and machinery, a sinister remnant of a past accident that still haunted her. The basic handyman skills, which most locals learned growing up, were also absent from her repertoire.
 Angela's interactions with Greg and Travis further complicated the equation. On one hand, Greg and Angela had a habit of arguing constantly over trivial matters, which once culminated in their failure to complete a shared basic training exercise adequately. On the other hand, Angela and Travis simply had nothing in common. Their conversations were often fraught with awkward silences, indicative of their lack of shared interests. This lack of coordination was epitomized during a recent team-building exercise when their team finished last.
 Greg was the blue-collar type with a broad frame and muscular build. He had a work ethic that never shied away from toiling through the day to get things done. Growing up, he often helped his father with simple home repairs and minor renovations, learning the ropes of basic handiwork. Additionally, Greg had fortified his skills while refurbishing an old shed with Travis, a testament to their compatible personalities. However, his dislike for education was well known throughout town, further amplified by his lack of patience, especially with children.
 Travis, the third cog in the wheel, was a man of many peculiarities. His stage fright was almost legendary and made it nearly impossible for him to stand in front of a crowd. Often, the mere thought of it could unnerve him. His physical constitution was lightweight and fragile, and long hours of manual labor made him weary. He also had a revulsion towards dirt that he complained about at every opportune moment. Like the others, studying did not appeal to him much, so much so that he had stopped reading completely after leaving school prematurely.
 The manager understood well that a team’s success depends heavily on the contribution and compatibility of each member. He observed, analyzed, and considered. Now, it was up to him to assign roles to Angela, Greg, and Travis. The school needed educators and maintenance staff, and each had to play their part perfectly.
 Given the story, how would you uniquely allocate each person to make sure both tasks are accomplished efficiently?
 Pick one of the following choices:
 1 - Teaching: Travis, Maintenance: Angela and Greg
 2 - Teaching: Greg, Maintenance: Angela and Travis
 3 - Teaching: Angela, Maintenance: Greg and Travis
 You must pick one option. The story should allow you to determine how good each person is at a skill. Roughly, each person is either great, acceptable, or bad at a task. We want to find an optimal assignment of people to tasks that uses their skills as well as possible. In addition, one task will have to have two people assigned to it. The effectiveness of their teamwork (great team, acceptable team, or bad team) also impacts the overall quality of the assignment.
 When two people need to work on a task and one is bad at it, they don’t necessarily benefit from the other person being good, unless they work well together.
 With different strengths, weaknesses, and interpersonal dynamics at play, you should allocate your team to find the single assignment to ensure that the tasks overall are completed as effectively as possible.
 Explain your reasoning step by step before you answer. Finally, the last thing you generate should be "ANSWER: (your answer here, including the choice number)"
 '''.strip()
 reasoning = '''
 Let's solve this by thinking step-by-step. First, we will figure out each person's skill level for each task. Then, we can measure how well they all work together in pairs. From this, we can find the most efficient assignment that maximizes the scores.
 Let's start with Angela. Angela can't articulate her thoughts, and she seems unprepared for teaching. So, let's assume her skill level is 1 for teaching. She also is bad at maintenance due to her fear of maintenance. So, let's assume her skill level is 1 for maintenance as well.
 Now, let's look at Greg. Greg has a dislike for education and a lack of patience, so let's assume his skill level for maintenance is 1. However, Greg has helped with home repairs and minor renovations, so let's assume his maintenance skill level is 2.
 Finally, let's look at Travis. Travis has extreme stage fright, which will make it difficult to teach, so let's assume his teaching skill level is 1. He also has a lightweight and fragile frame as well as hates dirt, so let's assume his maintenance skill level is 1.
 Now, let's look at the relationships and how people work together.
 Angela and Greg do not get along; they are constantly arguing, so let's assume their ability to work together is 1.
 Angela and Travis aren't much better. They both have nothing in common, and they couldn't do a team-building exercise previously, so let's assume their ability to work together is 1.
 Finally, Greg and Travis have worked together, and their personalities seem to meld, so let's assume they work well together with a score of 3.
 Let's summarize and figure out the best assignment.
 Angela is bad at teaching. (1)
 Angela is bad at maintenance. (1)
 Angela does not work well with Greg. (1)
 Angela does not work well with Travis. (1)
 Greg is bad at teaching. (1)
 Greg is okay with maintenance. (2)
 Greg and Travis work well together. (3)
 Travis is bad at teaching. (1)
 Travis is bad at maintenance. (1)
 Now, let's find the best assignment.
 Option 1: Travis as a teacher (1) + Angela working in maintenance (1) + Greg working in maintenance (2) + Angela and Greg work badly together (1) = 5
 Option 2: Greg as a teacher (1) + Angela working in maintenance (1) + Travis working in maintenance (1) + Angela and Travis work badly together (1) = 4
 Option 3: Angela as a teacher (1) + Greg working in maintenance (2) + Travis working in maintenance (1) + Greg and Travis work well together (3) = 7
 So, from this, we can see Option 3 has the maximum score.
 ANSWER: 3
 '''.strip()
 team_allocation_solved_ex = f'{story}\n\n{reasoning}'
--- a/opencompass/datasets/musr/tree.py
+++ b/opencompass/datasets/musr/tree.py
@ -0,0 +1,739 @@
 # flake8: noqa: E501
 """WARNING (or more like an aggressive note).
 A lot of functionality was implemented here for earlier experiments.  Most of which is not used.  We have left it here
 for backwards compatibility with the current dataset as well as because why not.
 ALSO NOTE:
 This file was created to have no dependencies on anything in the repo for a reason.  You can copy this file into your
 own project and use the classes to parse/visualize/edit the logic trees in the dataset or create your own.
 FINAL NOTE:
 See examples of how to create LogicNodes and LogicTrees in the __main__ part of the file.
 """
 import random
 from copy import deepcopy
 from enum import Enum
 from typing import Any, Dict, List
 import numpy as np
 class LogicNodeOperatorType:
    """How should the deduction combine the nodes (choose will randomly sample
    and/or when populate is called)"""
    AND = 'and'
    OR = 'or'
    CHOOSE = 'choose'
 class LogicNodeFactType:
    """Is a node explicit (mentioned in the story) or commonsense knowledge
    (left unsaid)"""
    EXPLICIT = 'explicit'
    COMMONSENSE = 'commonsense'
 class LogicNodeConstraints:
    """Useful for things like children = ['X is the murderer', 'Y is the murderer', 'Z is the murderer'], we no longer use this structure though."""
    ONLY_ONE_CAN_BE_TRUE = 'Only one child can be true'
 class LogicNodeDeductionType:
    """What type of deduction should be used here (not used currently)"""
    SYLLOGISM = 'syllogism'
    TEMPORAL = 'temporal'
    SPATIAL = 'spatial'
    CHOOSE = 'choose'
 class LogicNode:
    """A LogicNode is a tree primitive.
    It is either a deduction or a leaf fact.  Leaf facts are the ones that we
    use in story generation (if they are explicit facts and not commonsense).
    """
    value: str
    children: List['LogicNode']
    fact_type: str
    operator: str
    constraints: List[str]
    deduction_type: str
    prunable: bool
    can_be_leaf: bool
    def __init__(
        self,
        value: str = '',
        children: List['LogicNode'] = None,
        operator: str = LogicNodeOperatorType.OR,
        fact_type: str = LogicNodeFactType.EXPLICIT,
        constraints: List[str] = (),
        deduction_type: str = None,
        prunable: bool = True,
        can_be_leaf: bool = False,
        frozen: bool = False,
    ):
        """
        :param value: Content for this specific node (also the deduction of the children).
        :param children: The children for this node.
        :param operator: Should the children be "And"ed or "Or"ed to create the deduction (the content of this node).
        :param fact_type: Explicit or commonsense
        :param constraints: Not used anymore (see LogicNodeConstraints)
        :param deduction_type: Not used anymore (see LogicNodeDeductionType)
        :param prunable: Can this node be removed from the tree (we don't prune in our datasets)
        :param can_be_leaf: Can this node be a leaf node (usually false for nodes that you are injecting manually)
        :param frozen: Should we add/prune children in the populate function (if frozen, no children will be added or removed, but the children may have children appended/pruned from them).
        """
        self.value = value
        if children is None:
            children = []
        self.children = children
        self.operator = operator
        self.fact_type = fact_type
        self.constraints = constraints
        self.deduction_type = deduction_type
        self.prunable = prunable
        self.can_be_leaf = can_be_leaf
        self.frozen = frozen
        self.parent = None
    @property
    def children(self):
        return self._children
    @children.setter
    def children(self, children: List['LogicNode']):
        self._children = children
        for c in self.children:
            c.parent = self
    def __str__(self):
        line = []
        cnsts = ', '.join([str(x.value) for x in self.constraints])
        if self.value and self.value != '':
            line.append(self.value)
        if len(self.children) > 0:
            line.append(self.operator)
        else:
            line.append(self.fact_type)
        if self.deduction_type:
            line.append(self.deduction_type)
        if len(self.constraints) > 0:
            line.append(cnsts)
        if len(self.children) > 0:
            line.append(f'children: {len(self.children)}')
        return ' | '.join(line)
    def __repr__(self):
        return str(self)
    def to_json(self):
        return {
            'value': self.value,
            'children': [x.to_json() for x in self.children],
            'fact_type': self.fact_type,
            'operator': self.operator,
            'constraints': self.constraints,
            'deduction_type': self.deduction_type,
            'prunable': self.prunable,
            'can_be_leaf': self.can_be_leaf
        }
    @classmethod
    def from_json(cls, js):
        js['children'] = [LogicNode.from_json(x) for x in js['children']]
        return cls(**js)
 class LogicTree:
    """Main datastructure used when creating a MuSR example.
    It's basically a standard tree with some parameters controlling the shape.
    """
    nodes: List[LogicNode]
    chance_of_or: float
    chance_of_cs_fact: float
    depth: int
    chance_to_prune: float
    chance_to_prune_all: float
    bf_factor: Dict[int, float]
    deduction_type_sample_rate: Dict[LogicNodeDeductionType, float]
    root_structure: List[List[LogicNode]] = ()
    def __init__(self,
                 chance_of_or: float = 0.3,
                 chance_of_cs_fact: float = 0.1,
                 depth: int = 2,
                 chance_to_prune: float = 0.6,
                 chance_to_prune_all: float = 0.2,
                 bf_factor: Dict[int, float] = None,
                 deduction_type_sample_rate: Dict[LogicNodeDeductionType,
                                                  float] = None,
                 enforce_cs_fact_per_level: bool = False,
                 root_structure: List[Any] = (),
                 nodes: List[LogicNode] = (),
                 populate: bool = True,
                 prune: bool = True):
        """
        :param chance_of_or: (not used) how often should a node with children be an OR
        :param chance_of_cs_fact: (not used) how often should there be a commonsense node
        :param depth: How deep should a tree go
        :param chance_to_prune: Percentage chance of pruning a node
        :param chance_to_prune_all: Percentage chance of pruning all children from a node.
        :param bf_factor: Branching factor (dictionary of percentages {1: 0.33, 2:0.33, 3:0.33} for example.
        :param deduction_type_sample_rate: (not used, see bf_factor and LogicNodeDeductionType)
        :param enforce_cs_fact_per_level: Enforce 1 commonsense fact per level in the tree (we use this instead of chance_of_cs_fact)
        :param root_structure: List of LogicNodes to build off of.
        :param nodes: List of LogicNodes to define the LogicTree on (we will not populate/prune the tree if this is filled)
        :param populate: Should we populate children for the tree according to the other parameters?
        :param prune: Should we prune the children for the tree according to the other parameters?
        """
        self.chance_of_or = chance_of_or
        self.chance_of_cs_fact = chance_of_cs_fact
        self.depth = depth
        self.chance_to_prune = chance_to_prune
        self.chance_to_prune_all = chance_to_prune_all
        self.bf_factor = bf_factor
        self.enforce_cs_fact_per_level = enforce_cs_fact_per_level
        if not bf_factor:
            self.bf_factor = {2: 0.8, 3: 0.2}
        if not deduction_type_sample_rate:
            deduction_type_sample_rate = {
                LogicNodeDeductionType.SYLLOGISM: 1.0
            }
        self.deduction_type_sample_rate = deduction_type_sample_rate
        self.root_structure = root_structure
        if len(nodes) > 0:
            self.nodes = nodes
        else:
            if root_structure is not None and len(root_structure) > 0:
                self.nodes = root_structure
            else:
                self.nodes = [
                    LogicNode('root', operator=LogicNodeOperatorType.AND)
                ]
            if populate:
                [self.populate(x, 1) for x in self.nodes]
            if prune:
                [self.prune(x, 1) for x in self.nodes]
    def __str__(self):
        return self.print_tree()
    def get_facts(self,
                  include_cs: bool = False,
                  include_deductions_from_level: int = -1,
                  no_facts_after_depth: int = -1):
        """Get a list of LogicNodes from the tree. By default, you will get the
        explicit leaf nodes.
        :param include_cs: Include the commonsense nodes from all levels.
        :param include_deductions_from_level: Include any intermediate deduction nodes from the specified level and deeper.
        :param no_facts_after_depth: Essentially tree the deductions at the specified depth as leaf nodes.
        """
        def recurse_facts(_node: LogicNode, depth: int = 0) -> List[str]:
            node = deepcopy(_node)
            if depth >= no_facts_after_depth and no_facts_after_depth > -1:
                node.children = []
            facts = []
            if node.fact_type == LogicNodeFactType.EXPLICIT and len(
                    node.children) == 0:
                facts.append(node)
            if node.fact_type == LogicNodeFactType.COMMONSENSE and include_cs and len(
                    node.children) == 0:
                facts.append(node)
            if len(
                    node.children
            ) > 0 and include_deductions_from_level <= depth and include_deductions_from_level > -1:
                facts.append(node)
            for child in node.children:
                facts.extend(recurse_facts(child, depth + 1))
            return list(set(facts))
        facts = []
        for n in self.nodes:
            facts.extend(recurse_facts(n))
        return facts
    def print_tree(self, node=None, level=0):
        """Deprecated (not used)"""
        if node is None:
            node = self.nodes[0]
        line = '-' * level * 4 + str(node) + (' | ' + str(node.operator) if
                                              len(node.children) > 0 else '')
        for child in node.children:
            line += '\n' + self.print_tree(child, level + 1)
        return line
    def print_for_gpt(self,
                      node=None,
                      level=0,
                      pad_char=' ',
                      pad_space=4,
                      print_forward=True,
                      print_conjection_types: bool = False,
                      print_reasoning_types: bool = False,
                      ignore_value_after_depth: int = -1,
                      print_only_nodes_with_value: bool = False):
        """Complex print function.  We often use it as
        print_for_gpt(pad_space=1, pad_char='> ')
        However, more complex arguments can be used to control what is printed.
        This returns a string that must be printed (don't be confused by the method name.)
        :param node: Start at a specific node.
        :param level: Controls how much tabbing is done when printing the current node.
        :param pad_char: Char to use that specifies depth ('> ' at depth 3 will look like '> > > ' if you have pad_space equal to 1 for example)
        :param pad_space: How many spaces to include between pad_chars
        :param print_forward: Print the tree with parent nodes first.
        :param print_conjection_types: Print the Ands and Ors per deduction (not used)
        :param print_reasoning_types: Print the deduction types (not used)
        :param ignore_value_after_depth: Ignore content of the nodes once a depth is met
        :param print_only_nodes_with_value: Ignore nodes without content.
        """
        line = ''
        if node is None:
            node = self.nodes[0]
        if not print_forward:
            for child in node.children:
                v = self.print_for_gpt(
                    child,
                    level + 1,
                    pad_char=pad_char,
                    pad_space=pad_space,
                    print_forward=print_forward,
                    ignore_value_after_depth=ignore_value_after_depth,
                    print_only_nodes_with_value=print_only_nodes_with_value)
                if v != '':
                    line += v + '\n'
        ignore_val = ignore_value_after_depth > -1 and ignore_value_after_depth < level
        ignore_line = print_only_nodes_with_value and node.value == ''
        if ignore_line:
            line_val = ''
        else:
            line_val = (node.value + ' | ' if node.value != '' and not ignore_val else '') + (
                ('Fact From Story' if node.fact_type == LogicNodeFactType.EXPLICIT else 'Commonsense Knowledge') \
                    if len(node.children) == 0 else 'Deduced Fact')
            if level == 0:
                line_val = (node.value + ' | ' if node.value != '' else
                            '') + 'Deduced Root Conclusion'
            if len(node.children) > 0 and (print_conjection_types
                                           or print_reasoning_types):
                if print_conjection_types:
                    line_val += f' ({node.operator}'
                else:
                    line_val += f'('
                if node.deduction_type and print_reasoning_types:
                    line_val += f' | {node.deduction_type})'
                else:
                    line_val += ')'
            if len(node.constraints) > 0:
                cnsts = ', '.join([str(x) for x in node.constraints])
                line_val += f' constraints: [{cnsts}]'
            line += pad_char * level * pad_space + line_val
        if print_forward:
            for child in node.children:
                v = self.print_for_gpt(
                    child,
                    level + 1,
                    pad_char=pad_char,
                    pad_space=pad_space,
                    print_forward=print_forward,
                    ignore_value_after_depth=ignore_value_after_depth,
                    print_only_nodes_with_value=print_only_nodes_with_value)
                if v != '':
                    line += '\n' + v
        return line
    def populate(self, node: LogicNode, current_depth: int = 1):
        if node.operator == LogicNodeOperatorType.CHOOSE:
            node.operator = LogicNodeOperatorType.OR \
                if random.random() < self.chance_of_or else LogicNodeOperatorType.AND
        if node.deduction_type == LogicNodeDeductionType.CHOOSE:
            if node.operator != LogicNodeOperatorType.AND:
                node.deduction_type = None
            else:
                node.deduction_type = random.choices(
                    list(self.deduction_type_sample_rate.keys()),
                    list(self.deduction_type_sample_rate.values()),
                    k=1)[0]
        if not node.frozen:
            bf = max(
                0,
                random.choices(list(self.bf_factor.keys()),
                               list(self.bf_factor.values()),
                               k=1)[0] - len(node.children))
            if bf > 0:
                new_nodes = []
                one_fact_is_cs = False
                for idx in range(bf):
                    roll_for_or = random.random()
                    fact_type = LogicNodeFactType.COMMONSENSE \
                        if random.random() < self.chance_of_cs_fact and not one_fact_is_cs else \
                        LogicNodeFactType.EXPLICIT
                    if roll_for_or > self.chance_of_or and\
                            current_depth < self.depth and\
                            not fact_type == LogicNodeFactType.COMMONSENSE:
                        new_nodes.append(
                            LogicNode(
                                f'',
                                operator=LogicNodeOperatorType.AND,
                                fact_type=fact_type,
                                deduction_type=random.choices(
                                    list(self.deduction_type_sample_rate.keys(
                                    )),
                                    list(self.deduction_type_sample_rate.
                                         values()),
                                    k=1)[0],
                                prunable=True,
                                can_be_leaf=True,
                            ))
                    else:
                        new_nodes.append(
                            LogicNode(f'',
                                      operator=LogicNodeOperatorType.OR,
                                      fact_type=fact_type,
                                      prunable=True,
                                      can_be_leaf=True))
                    if fact_type == LogicNodeFactType.COMMONSENSE:
                        node.operator = LogicNodeOperatorType.AND
                        if not node.deduction_type:
                            node.deduction_type = random.choices(
                                list(self.deduction_type_sample_rate.keys()),
                                list(self.deduction_type_sample_rate.values()),
                                k=1)[0]
                        one_fact_is_cs = True
                if not one_fact_is_cs and self.enforce_cs_fact_per_level:
                    new_nodes.append(
                        LogicNode(f'',
                                  operator=LogicNodeOperatorType.OR,
                                  fact_type=LogicNodeFactType.COMMONSENSE,
                                  prunable=False,
                                  can_be_leaf=True))
                node.children.extend(new_nodes)
        if current_depth < self.depth:
            for node in node.children:
                if node.fact_type == LogicNodeFactType.COMMONSENSE:
                    continue
                self.populate(node, current_depth + 1)
    def prune(self, node: LogicNode, current_depth: int = 1):
        to_prune = []
        if current_depth > 1 and node.can_be_leaf:
            if random.random() < self.chance_to_prune_all:
                node.children = []
                return
        prunable = [x for x in node.children if x.prunable]
        if (len(prunable) > 1 and node.operator == LogicNodeOperatorType.OR or\
                len(prunable) > 2 and node.operator == LogicNodeOperatorType.AND) and\
                current_depth <= self.depth:
            if node.prunable:
                for n in random.sample(
                        prunable,
                        len(prunable) -
                    (1 if node.operator == LogicNodeOperatorType.OR else 2)):
                    roll_to_prune = random.random()
                    if roll_to_prune < self.chance_to_prune:
                        to_prune.append(n)
        node.children = [x for x in node.children if x not in to_prune]
        for n in node.children:
            self.prune(n, current_depth + 1)
    def to_json(self):
        args = {
            'chance_of_or': self.chance_of_or,
            'depth': self.depth,
            'chance_to_prune': self.chance_to_prune,
            'chance_to_prune_all': self.chance_to_prune_all,
            'bf_factor': self.bf_factor,
            'deduction_type_sample_rate': self.deduction_type_sample_rate,
            'root_structure': [x.to_json() for x in self.root_structure],
            'nodes': [x.to_json() for x in self.nodes]
        }
        return args
    @classmethod
    def from_json(cls, _js):
        js = deepcopy(_js)
        js['nodes'] = [LogicNode.from_json(x) for x in js['nodes']]
        js['root_structure'] = [
            LogicNode.from_json(x) for x in js['root_structure']
        ]
        return cls(**js)
 if __name__ == '__main__':
    """EXAMPLE USES."""
    def tv_scene_ex():
        root_structure = [
            LogicNode('A good drama tv scene',
                      operator=LogicNodeOperatorType.OR,
                      prunable=False,
                      can_be_leaf=False,
                      frozen=True)
        ]
        root_structure[0].children = [
            LogicNode('Bob is sad.',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=True,
                      can_be_leaf=False),
            LogicNode('John now hates Bob.',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=True,
                      can_be_leaf=False),
            LogicNode('Bob bought a car.',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=True,
                      can_be_leaf=False),
            LogicNode('Bob wanted to be happy.',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=True,
                      can_be_leaf=False),
        ]
        tree = LogicTree(depth=4,
                         root_structure=root_structure,
                         bf_factor={
                             1: 0.5,
                             2: 0.5
                         },
                         chance_of_or=0.0,
                         chance_of_cs_fact=0.0,
                         chance_to_prune_all=0.5,
                         chance_to_prune=0.5,
                         enforce_cs_fact_per_level=True)
        rep = tree.print_for_gpt(pad_space=1, pad_char='- ')
        print(rep)
    def eb_ex():
        root_structure = [
            LogicNode('',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=False,
                      can_be_leaf=False)
        ]
        n = LogicNode('Eruptions block sunlight.',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=False,
                      can_be_leaf=False,
                      frozen=True)
        n.children = [
            LogicNode('Eruptions produce ash clouds.',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=False,
                      can_be_leaf=True,
                      frozen=True),
            LogicNode('Ash blocks sunlight.',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=False,
                      can_be_leaf=True,
                      frozen=True),
        ]
        g = LogicNode('Eruptions can cause plants to die.',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=True,
                      can_be_leaf=False,
                      frozen=True)
        g.children = [
            n,
            LogicNode('Producers will die without sunlight.',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=False,
                      can_be_leaf=True,
                      frozen=True)
        ]
        l = LogicNode('',
                      operator=LogicNodeOperatorType.AND,
                      prunable=False,
                      can_be_leaf=False)
        l.children = [g]
        root_structure[0].children = [l]
        tree = LogicTree(depth=5,
                         root_structure=root_structure,
                         bf_factor={
                             1: 0.3,
                             2: 0.7
                         },
                         chance_of_or=0.0,
                         chance_of_cs_fact=0.0,
                         chance_to_prune_all=0.0,
                         chance_to_prune=0.0,
                         enforce_cs_fact_per_level=True)
        rep = tree.print_for_gpt(pad_space=1, pad_char='- ')
        print(rep)
    def murder_mystery_ex():
        root_structure = [
            LogicNode('Killer',
                      operator=LogicNodeOperatorType.OR,
                      constraints=[LogicNodeConstraints.ONLY_ONE_CAN_BE_TRUE],
                      prunable=False,
                      can_be_leaf=False,
                      frozen=True)
        ]
        suspect_nodes = [
            LogicNode(f'Murderer Suspect {idx + 1}',
                      operator=LogicNodeOperatorType.AND,
                      prunable=False,
                      can_be_leaf=False,
                      frozen=True) for idx in range(1)
        ]
        for s in suspect_nodes:
            s.children = [
                LogicNode('Suspect has means',
                          operator=LogicNodeOperatorType.CHOOSE,
                          prunable=True,
                          can_be_leaf=False),
                LogicNode('Suspect has motive',
                          operator=LogicNodeOperatorType.CHOOSE,
                          prunable=True,
                          can_be_leaf=False),
                LogicNode('Suspect has opportunity',
                          operator=LogicNodeOperatorType.CHOOSE,
                          prunable=True,
                          can_be_leaf=False)
            ]
        root_structure[0].children = suspect_nodes
        tree = LogicTree(depth=4,
                         root_structure=root_structure,
                         bf_factor={
                             1: 0.5,
                             2: 0.5
                         },
                         chance_of_or=0.0,
                         chance_of_cs_fact=0.0,
                         chance_to_prune_all=0.5,
                         chance_to_prune=0.5,
                         enforce_cs_fact_per_level=True)
        rep = tree.print_for_gpt(pad_space=1, pad_char='> ')
        print(rep)
    def action_ex():
        root_structure = [
            LogicNode('Take an action',
                      operator=LogicNodeOperatorType.OR,
                      prunable=False,
                      can_be_leaf=False,
                      frozen=True)
        ]
        root_structure[0].children = [
            LogicNode('Run away',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=False,
                      can_be_leaf=False,
                      frozen=True),
            LogicNode('Fight back',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=False,
                      can_be_leaf=False,
                      frozen=True),
            LogicNode('Hide',
                      operator=LogicNodeOperatorType.CHOOSE,
                      prunable=False,
                      can_be_leaf=False,
                      frozen=True),
        ]
        for cidx, c in enumerate(root_structure[0].children):
            nfacts = random.randint(2, 4)
            for n in range(nfacts):
                fact = LogicNode('',
                                 operator=LogicNodeOperatorType.CHOOSE,
                                 prunable=False,
                                 can_be_leaf=False,
                                 frozen=True)
                fact.children = [
                    LogicNode('Pro (supporting the parent action)',
                              operator=LogicNodeOperatorType.CHOOSE,
                              prunable=True,
                              can_be_leaf=False,
                              frozen=False),
                    LogicNode('Con (counters the sibling Pro only)',
                              operator=LogicNodeOperatorType.CHOOSE,
                              prunable=True,
                              can_be_leaf=False,
                              frozen=False)
                ]
                root_structure[0].children[cidx].children.append(fact)
        tree = LogicTree(depth=4,
                         root_structure=root_structure,
                         bf_factor={
                             1: 0.25,
                             2: 0.5,
                             3: 0.25
                         },
                         chance_of_or=0.0,
                         chance_of_cs_fact=0.0,
                         chance_to_prune_all=0.5,
                         chance_to_prune=0.75,
                         enforce_cs_fact_per_level=True)
        rep = tree.print_for_gpt(pad_space=1, pad_char='- ')
        print(rep)
    tv_scene_ex()
    eb_ex()
    action_ex()
--- a/opencompass/utils/datasets_info.py
+++ b/opencompass/utils/datasets_info.py
@ -327,6 +327,11 @@ DATASETS_MAPPING = {
        "hf_id": "",
        "local": "./data/mmmlu_lite",
    },
     "opencompass/musr": {
        "ms_id": "",
        "hf_id": "",
        "local": "./data/musr",
    },   
    "opencompass/babilong": {
        "ms_id": "",
        "hf_id": "",
@ -335,6 +340,10 @@ DATASETS_MAPPING = {
 }
 DATASETS_URL = {
    "/musr": {
        "url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/musr.zip",
        "md5": "7447d2a5bec4586035196102135e2af9",
    },
    "/mmlu/": {
        "url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip",
        "md5": "761310671509a239e41c4b717f7fab9c",