[Feature] MuSR Datset Evaluation (#1689)

* MuSR Datset Evaluation

* MuSR Datset Evaluation

Add an assertion and a Readme.md
This commit is contained in:
abrohamLee 2024-11-14 20:42:12 +08:00 committed by GitHub
parent d415439f9b
commit e9e4b69ddb
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
13 changed files with 1539 additions and 0 deletions

View File

@ -57,6 +57,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2024.11.14\]** OpenCompass now offers support for a sophisticated benchmark designed to evaluate complex reasoning skills — [MuSR](https://arxiv.org/pdf/2310.16049). Check out the [demo](configs/eval_musr.py) and give it a spin! 🔥🔥🔥
- **\[2024.11.14\]** OpenCompass now supports the brand new long-context language model evaluation benchmark — [BABILong](https://arxiv.org/pdf/2406.10149). Have a look at the [demo](configs/eval_babilong.py) and give it a try! 🔥🔥🔥
- **\[2024.10.14\]** We now support the OpenAI multilingual QA dataset [MMMLU](https://huggingface.co/datasets/openai/MMMLU). Feel free to give it a try! 🔥🔥🔥
- **\[2024.09.19\]** We now support [Qwen2.5](https://huggingface.co/Qwen)(0.5B to 72B) with multiple backend(huggingface/vllm/lmdeploy). Feel free to give them a try! 🔥🔥🔥

44
configs/eval_musr.py Normal file
View File

@ -0,0 +1,44 @@
from mmengine.config import read_base
import os.path as osp
with read_base():
from opencompass.configs.datasets.musr.musr_gen import musr_datasets
# from opencompass.configs.models.hf_internlm.hf_internlm2_5_1_8b_chat import models
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import (
models as lmdeploy_internlm2_5_7b_chat_model,
)
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import (
models as lmdeploy_qwen2_5_7b_instruct_model,
)
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import (
models as lmdeploy_qwen2_5_14b_instruct_model,
)
from opencompass.configs.models.yi.lmdeploy_yi_1_5_9b_chat import (
models as lmdeploy_yi_1_5_9b_chat_model,
)
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_32b_instruct import (
models as lmdeploy_qwen2_5_32b_instruct_model,
)
from opencompass.configs.models.chatglm.lmdeploy_glm4_9b_chat import (
models as lmdeploy_glm4_9b_chat_model,
)
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import (
models as lmdeploy_llama3_1_8b_instruct_model,
)
from opencompass.configs.models.mistral.lmdeploy_ministral_8b_instruct_2410 import (
models as lmdeploy_ministral_8b_instruct_2410_model,
)
from opencompass.configs.models.gemma.lmdeploy_gemma_9b_it import (
models as lmdeploy_gemma_9b_it_model,
)
from opencompass.configs.models.gemma.lmdeploy_gemma_27b_it import (
models as lmdeploy_gemma_27b_it_model,
)
from opencompass.configs.summarizers.groups.musr_average import summarizer
datasets = [*musr_datasets]
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
base_exp_dir = 'outputs/musr/'
work_dir = osp.join(base_exp_dir, 'musr_eval')

View File

@ -0,0 +1,75 @@
# MuSR: Multistep Soft Reasoning Dataset
MuSR (Multistep Soft Reasoning) is a dataset designed to evaluate language models (LLMs) on complex reasoning tasks embedded in natural language narratives. Created to challenge state-of-the-art models like GPT-4 and others, MuSR emphasizes nuanced reasoning across different domains, including social and physical reasoning, commonsense reasoning, and planning, with tasks framed within realistic scenarios such as murder mysteries, object placements, and team allocations.
## Overview
### Purpose
Current large language models can perform complex tasks through prompting techniques like chain-of-thought reasoning. However, robust multistep reasoning remains challenging. MuSR addresses these limitations by evaluating LLM performance on tasks involving multistep reasoning in three domains:
- **Murder Mysteries**: Requires social and physical deductive reasoning.
- **Object Placements**: Tests observational and theory-of-mind reasoning.
- **Team Allocations**: Focuses on social reasoning and constraint satisfaction.
### Dataset Construction
MuSR instances are generated using a neurosymbolic synthetic-to-natural narrative generation algorithm. This approach allows for the creation of complex reasoning instances that combine structured reasoning trees with natural language narratives, challenging both direct and nuanced inference capabilities in LLMs.
MuSR's dataset consists of:
- **Murder Mysteries**: Scenarios with suspects, motives, and opportunities requiring deductive inference.
- **Object Placements**: Scenarios where individuals' observations inform reasoning about object locations.
- **Team Allocations**: Scenarios that simulate social relationships and teamwork for optimal task assignments.
### Dataset Access
MuSR dataset is publicly available, with instructions provided on the [GitHub Project](https://github.com/Zayne-Sprague/MuSR). You can download the dataset and use pre-defined prompts or create your own configurations.
### Evaluation
1. Install dependencies and configure the environment.
2. Run evaluations using `opencompass configs/eval_musr.py` to assess LLM performance.
3. Analyze results against human performance benchmarks.
### Example Command
```bash
opencompass configs/eval_musr.py
```
## Baselines and Results
MuSR includes baseline results for multiple LLMs evaluated with chain-of-thought and advanced reasoning strategies. These benchmarks assess model accuracy on reasoning tasks across the three domains.
| Domain | Baseline Accuracy (GPT-4) | Human Performance |
|------------------|---------------------------|--------------------|
| Murder Mystery | 80.4% | 94.1% |
| Object Placement | 60.9% | 95.0% |
| Team Allocation | 68.4% | 100% |
| dataset | version | metric | mode | internlm2_5-7b-chat-turbomind | qwen2.5-7b-instruct-turbomind | qwen2.5-14b-instruct-turbomind | yi-1.5-9b-chat-turbomind | qwen2.5-32b-instruct-turbomind | glm-4-9b-chat-turbomind | llama-3_1-8b-instruct-turbomind | ministral-8B-instruct-2410-turbomind | gemma-2-9b-it-turbomind | gemma-2-27b-it-turbomind |
|----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | -----|
| musr_murder_mysteries | a5ce30 | accuracy | gen | 59.20 | 63.20 | 76.00 | 68.80 | 78.80 | 71.20 | 73.60 | 73.60 | 74.80 | 77.20 |
| musr_object_placements | a5ce30 | accuracy | gen | 54.69 | 56.25 | 57.42 | 52.73 | 66.02 | 49.22 | 57.42 | 60.94 | 60.94 | 62.11 |
| musr_team_allocation | a5ce30 | accuracy | gen | 39.20 | 32.40 | 55.60 | 40.00 | 67.60 | 50.40 | 46.00 | 36.40 | 40.80 | 41.20 |
| musr_average | - | naive_average | gen | 51.03 | 50.62 | 63.01 | 53.84 | 70.81 | 56.94 | 59.01 | 56.98 | 58.85 | 60.17 |
## Citation
If you use MuSR in your research, please cite:
```bibtex
@misc{sprague2024musrtestinglimitschainofthought,
title={MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning},
author={Zayne Sprague and Xi Ye and Kaj Bostrom and Swarat Chaudhuri and Greg Durrett},
year={2024},
eprint={2310.16049},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2310.16049},
}
```
## Details
For further details, please refer to the MuSR paper [here](https://arxiv.org/abs/2310.16049).

View File

@ -0,0 +1,135 @@
from opencompass.datasets import MusrDataset, MusrEvaluator
from opencompass.openicl import PromptTemplate, ZeroRetriever, GenInferencer
DATASET_CONFIGS = {
'murder_mysteries': {
'abbr': 'musr_murder_mysteries',
'name': 'murder_mysteries',
'path': 'opencompass/musr',
'reader_cfg': dict(
input_columns=['context', 'question_text', 'question', 'answer', 'choices', 'choices_str', 'intermediate_trees', 'intermediate_data', 'prompt', 'system_prompt', 'gold_answer', 'scidx', 'self_consistency_n', 'ablation_name'],
output_column='gold_answer',
),
'infer_cfg': dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt='{system_prompt}'
)
],
round=[
dict(
role='HUMAN',
prompt='{prompt}'
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
),
'eval_cfg': dict(
evaluator=dict(
type=MusrEvaluator,
answer_index_modifier=1,
self_consistency_n=1
),
),
},
'object_placements': {
'abbr': 'musr_object_placements',
'name': 'object_placements',
'path': 'opencompass/musr',
'reader_cfg': dict(
input_columns=['context', 'question_text', 'question', 'answer', 'choices', 'choices_str', 'intermediate_trees', 'intermediate_data', 'prompt', 'system_prompt', 'gold_answer', 'scidx', 'self_consistency_n', 'ablation_name'],
output_column='gold_answer',
),
'infer_cfg': dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt='{system_prompt}'
)
],
round=[
dict(
role='HUMAN',
prompt='{prompt}'
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
),
'eval_cfg': dict(
evaluator=dict(
type=MusrEvaluator,
answer_index_modifier=1,
self_consistency_n=1
),
),
},
'team_allocation': {
'abbr': 'musr_team_allocation',
'name': 'team_allocation',
'path': 'opencompass/musr',
'reader_cfg': dict(
input_columns=['context', 'question_text', 'question', 'answer', 'choices', 'choices_str', 'intermediate_trees', 'intermediate_data', 'prompt', 'system_prompt', 'gold_answer', 'scidx', 'self_consistency_n', 'ablation_name'],
output_column='gold_answer',
),
'infer_cfg': dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role='SYSTEM',
fallback_role='HUMAN',
prompt='{system_prompt}'
)
],
round=[
dict(
role='HUMAN',
prompt='{prompt}'
),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
),
'eval_cfg': dict(
evaluator=dict(
type=MusrEvaluator,
answer_index_modifier=1,
self_consistency_n=1
),
),
},
}
musr_datasets = []
for config in DATASET_CONFIGS.values():
dataset = dict(
abbr=config['abbr'],
type=MusrDataset,
path=config['path'],
name=config['name'],
reader_cfg=config['reader_cfg'],
infer_cfg=config['infer_cfg'],
eval_cfg=config['eval_cfg'],
)
musr_datasets.append(dataset)

View File

@ -0,0 +1,19 @@
summarizer = dict(
dataset_abbrs=[
'musr_murder_mysteries',
'musr_object_placements',
'musr_team_allocation',
'musr_average'
],
summary_groups=[
{
'name': 'musr_average',
'subsets': [
'musr_murder_mysteries',
'musr_object_placements',
'musr_team_allocation',
],
}
],
)

View File

@ -87,6 +87,7 @@ from .mmlu_pro import * # noqa: F401, F403
from .MMLUArabic import * # noqa: F401, F403
from .mmmlu import * # noqa: F401, F403
from .multirc import * # noqa: F401, F403
from .musr import * # noqa: F401, F403
from .narrativeqa import * # noqa: F401, F403
from .natural_question import * # noqa: F401, F403
from .natural_question_cn import * # noqa: F401, F403

View File

@ -0,0 +1 @@
from .musr import * # noqa: F401, F403

View File

@ -0,0 +1,81 @@
# flake8: noqa: E501
story = """
In the smoke-filled haze of a thriving jazz club, Alice met her explosive end, leaving Detective Winston to sift through the suspects: Eugene, the shy pianist, and Gabrielle, the sassy club singer.
While seated at his desk at the precinct, Winston received a phone call from a certain observant local bartender, tipping off the police about a harsh row peaking in a nearby jazz club. He signaled to his partner as they promptly dispatched to the scene, already ringing with sirens and a restless crowd.
With the police line restraining the horde, the jazz club was undergoing a full round-up as Winston approached the informative bartender. The bartender was engrossed in his account to the officers about a raucous, punch throwing fight Eugene was part of, to his best recollection. Winston remembered Eugene, a jazz fanaticlurking around the jazz corners more often than anyone else could recount.
In the heart of the upheaval, lay a woman sprawled on the floor, later identified as Alice, a frequent face at the jazz scene and a financial analyst deeply engrossed in financial transactions. In public, Alice had made her concerns known about her discovery of fraudulent transactions at the bank, promising to report the same to the authorities. Eugene, remembered conspicuously for being a bank teller at the same bank Alice worked at, suddenly seemed closely linked.
Eugenes arrest was far from hushed, with the local news broadcasting the progressing drama live, catching sight of Eugene curtailed in handcuffs. Concurrently, it was ascertainedEugene was a member of the jazz club. This evidence backed by a jazz club membership card retrieved from his wallet during the arrest.
Just a few steps away, he noticed a man in a suit, the bouncer, a calm figure amid the bedlam. In their conversation, the bouncer corroborated that he had indeed seen Eugene involved in a heated scuffle, landing a few punches. The whisperings were starting to gain momentum, since Eugene was believed to be on the losing end of a lawsuita battle courtesy of Alice charging Eugene with the financial fraud she had publicly vowed to expose.
Eugene was known for his frequent presence at the jazz club and on top of that, was an actual member. Therefore, it was hardly a leap to presume Alice meeting her untimely end at the club was no mere happenstance. The jazz club, despite its dim lights and pulsating music, was a public place easily accessible to all, including potential suspects like Eugene and, sadly, the ill-starred Alice.
Det. Winston knew he was now tasked with a cryptic puzzle. A bank teller, embroiled in suspected fraud and a lawsuit, a jazz club murder scene and a local financial analystall woven into a ghastly murder mystery. He sighed in distaste as Eugene was escorted awaya man still oblivious to the chain of events waiting for him. But Winston knew, the night had only just begun for him.
Winston stared down at the crumpled microphone on the floor. He picked it up gingerly, turning it in his hand. The club was in disarray, debris scattered like confetti. The lab boys were still picking pieces of the grenade apart.
"Gabrielle's microphone," the coroner confirmed, barely looking up from his task.
"Give him the once-over for evidence," Winston said, handing the microphone to a nearby officer.
Leaving the club behind him, Winston sighed heavily. The world of jazz had taken a dark turn that night. Alice, the acclaimed critic with her sarcastic wit and keen critical eye, had been last seen alive here. Her purse lay in the club untouched, a testament to the abruptness of the event.
Gabrielle had been working as a war correspondent. Winston had read her articles. They were richly detailed, passionate, and highlighted the harsh reality of war zones. Gabrielle hadn't been shy about sharing her experiences or publicly criticizing the military in her pieces. She boldly interviewed military personnel and spent extended periods in conflict zones.
Alice, though, never missed a chance to pick apart Gabrielle's articles. The vitriolic snippets in Alices column were regular features and Gabrielle's staunch defense of her articles, her work in the jazz scene, did little against Alice's respected reputation.
The tension between them was palpable. Alice had been awarded a major journalist award that Gabrielle had desired. This only deepened their rivalry, with Gabrielle feeling overlooked for this recognition in the Jazz scene.
Winston cast his gaze over the club once morea hub of pulsating rhythms now eerily silent.
A significant part of the evening was Gabrielle's recorded interview with Alice. It played on the local radio, their professional rivalry subtly echoing under their professional demeanor.
With a deep breath, Winston knew he had a tall task ahead. The jazz club, where Alice was last seen alive was now shrouded in an eerie silence, the vibrant rhythms of what used to be a lively night echoing in the abandoned stage. It was up to him to piece together the missing notes and bring the symphony of this unsolved case to a satisfying finale.
Who is the most likely murderer?
Pick one of the following choices:
1 - Eugene
2 - Gabrielle
You must pick one option. Before selecting a choice, explain your reasoning step by step. The murderer needs to have a means (access to weapon), motive (reason to kill the victim), and opportunity (access to crime scene) in order to have killed the victim. Innocent suspects may have two of these proven, but not all three. An innocent suspect may be suspicious for some other reason, but they will not have all of motive, means, and opportunity established.
If you believe that both suspects have motive, means, and opportunity, you should make an educated guess pick the one for whom these are best established. If you believe that neither suspect has all three established, then choose the suspect where these are most clearly established. Explain your reasoning step by step before you answer. Finally, the last thing you generate should be "ANSWER: (your answer here, including the choice number)"
""".strip()
reasoning = """
Let's break this down step-by-step by first deducing which of the two suspects has a means, motive, and opportunity.
We will start with Eugene.
Eugene was being sued by Alice for fraudulent transactions. The charge was also very public. Both of these facts point to Eugene having a strong motive.
Because Eugene has a jazz club membership, and we can deduce that the jazz club membership belongs to the same club Alice was murdered in, we can assume Eugene has an opportunity to commit the crime.
Although we know Eugene is aggressive because he was throwing punches in the story, we do not know if he has access to the murder weapon. Because he does not have access to a grenade, he does not have a means.
Let's review Gabrielle next.
Gabrielle's purse was found at the scene of the crime, and we can then assume she had the opportunity to kill Alice.
Because Gabrielle has been in conflict zones with military personnel, it's possible that she has access to a grenade. We can say that Gabrielle has a potential means to kill the victim.
Finally, it appears that Gabrielle and Alice had a rivalry over journalism, which could have boiled up into physical action. Because of this, we can say that Gabrielle has a potential motive to kill the victim.
Now, reviewing the evidence, we see that:
Eugene has a motive and opportunity but no means.
Gabrielle has a motive, means, and opportunity.
Therefore, Gabrielle is the most likely murderer.
ANSWER: 2
""".strip()
murder_mystery_solved_ex = f'{story}\n\n{reasoning}'

View File

@ -0,0 +1,309 @@
# flake8: noqa: E501
import json
import os.path as osp
from datasets import Dataset
from opencompass.datasets.base import BaseDataset
from opencompass.openicl import BaseEvaluator
from opencompass.registry import ICL_EVALUATORS, LOAD_DATASET
from opencompass.utils import get_data_path
from .murder_mystery_solved_ex import murder_mystery_solved_ex
from .object_placements_solved_ex import object_placements_solved_ex
from .team_allocation_solved_ex import team_allocation_solved_ex
from .tree import LogicTree
DATASET_CONFIGS = {
'murder_mysteries': {
'file_name':
'murder_mysteries.json',
'ex':
murder_mystery_solved_ex, # write user example here
'system_prompt':
'You are a helpful assistant that will answer the questions given by the user.',
'hint':
('Before selecting a choice, explain your reasoning step by step. '
'The murderer needs to have a means (access to weapon), motive (reason to kill the victim), '
'and opportunity (access to crime scene) in order to have killed the victim. '
'Innocent suspects may have two of these proven, but not all three. '
'An innocent suspect may be suspicious for some other reason, but they will not have all of motive, '
'means, and opportunity established.\n\n'
'If you believe that both suspects have motive, means, and opportunity, you should make an educated guess '
'and pick the one for whom these are best established. If you believe that neither suspect has all '
'three established, then choose the suspect where these are most clearly established.'
),
'hint_before_question':
False,
'answer_index_modifier':
1
},
'object_placements': {
'file_name':
'object_placements.json',
'ex':
object_placements_solved_ex,
'skip_ablated':
True,
'ablation_depth_modifier':
2,
'system_prompt':
'You are a helpful assistant that will answer the questions given by the user.',
'hint':
('Based on this story, we want to identify where someone believes that a certain object is at the end of '
'the story. In order to do that, you need to read the story and keep track of where they think the object '
'is at each point. When an object is moved, the person may observe its new location if they saw it move.\n\n'
'To see where an object ends up, they must be able to see the location that it moves to and not be too '
'distracted by what they are doing. If they do not observe the object moving, then they will still believe '
'it to be in the last location where they observed it.'),
'hint_before_question':
True,
'answer_index_modifier':
1
},
'team_allocation': {
'file_name':
'team_allocation.json',
'ex':
team_allocation_solved_ex,
'system_prompt':
'You are a helpful assistant that will answer the questions given by the user.',
'hint':
('The story should allow you to determine how good each person is at a skill. Roughly, each person is '
'either great, acceptable, or bad at a task. We want to find an optimal assignment of people to tasks '
'that uses their skills as well as possible. In addition, one task will have to have two people assigned '
'to it. The effectiveness of their teamwork (great team, acceptable team, or bad team) also impacts the '
'overall quality of the assignment.\n\n'
'When two people need to work on a task and one is bad at it, they don\'t necessarily benefit from the '
'other person being good, unless they work well together.\n\n'
'With different strengths, weaknesses, and interpersonal dynamics at play, you should allocate your team '
'to find the single assignment to ensure that the tasks overall are completed as effectively as possible.'
),
'hint_before_question':
False,
'answer_index_modifier':
1
}
}
@LOAD_DATASET.register_module()
class MusrDataset(BaseDataset):
"""MuSR.
Args:
path (str): path to dataset
name (str): name of dataset
self_consistency_n (int)
exclude_contrastive_examples (bool): Whether to exclude contrastive examples
reverse_contrastive_sample (bool): Whether to reverse the selection of contrastive samples
skip_ablated (bool): Whether to skip ablated samples
offset (int): Starting offset for the dataset
sample_size (int): Sample size, None indicates using the entire dataset.
"""
@staticmethod
def load(path,
name,
self_consistency_n=1,
exclude_contrastive_examples=False,
reverse_contrastive_sample=False,
skip_ablated=False,
randomize=False,
offset=0,
sample_size=None,
**kwargs):
"""Load the dataset and flatten fields while constructing prompts,
taking self_consistency_n and ablations into account."""
if name not in DATASET_CONFIGS:
raise ValueError(
f'Dataset name {name} not supported. Must be one of {list(DATASET_CONFIGS.keys())}'
)
config = DATASET_CONFIGS[name]
path = get_data_path(path)
file_path = osp.join(path, config['file_name'])
with open(file_path, 'r', encoding='utf-8') as f:
dataset = json.load(f)
filtered_dataset = []
hashes_done = []
for example in dataset:
if exclude_contrastive_examples and example['questions'][0].get('intermediate_data') and \
len(example['questions'][0].get('intermediate_data')) > 0 and \
example['questions'][0]['intermediate_data'][0].get('story_hash_id'):
story_hash = example['questions'][0]['intermediate_data'][0][
'story_hash_id']
if story_hash in hashes_done:
if reverse_contrastive_sample:
filtered_dataset.append(example)
else:
continue
elif not reverse_contrastive_sample:
filtered_dataset.append(example)
hashes_done.append(story_hash)
else:
filtered_dataset.append(example)
filtered_dataset = filtered_dataset[
offset:offset +
min(len(filtered_dataset), sample_size) if sample_size else None]
ablations = [
# {'prompt': 'regular', 'name': 'regular'},
# {'prompt': 'cot', 'name': 'cot'},
{
'prompt': 'cot+',
'name': 'cot+'
},
]
# create prompts
flattened_data = []
for example in filtered_dataset:
context = example['context']
questions = example['questions']
for question in questions:
choices_list = question['choices']
choices_str = '\n'.join([
f'{idx + 1} - {choice}'
for idx, choice in enumerate(choices_list)
])
gold_answer = question['answer'] + config.get(
'answer_index_modifier', 1)
for ablation in ablations:
prompt_style = ablation.get('prompt', 'cot+')
ablation_name = ablation.get('name', 'cot+')
for scidx in range(self_consistency_n):
ex_str = ''
if ablation.get('use_example') and config.get('ex'):
ex_str = (
'Here is an example of solving the task:\n\n' +
config.get('ex') +
'\n\nThis is the end of the example. The real task is below.\n\n---\n\n'
)
if prompt_style == 'regular':
prompt = f'{ex_str}{context}\n\n{question["question"]}\n\n' \
f'Pick one of the following choices:\n{choices_str}\n\n' \
'You must pick one option. Finally, the last thing you generate should be "ANSWER: (your answer here, include the choice number)"'
elif prompt_style == 'cot':
prompt = f'{ex_str}{context}\n\n{question["question"]}\n\n' \
f'Pick one of the following choices:\n{choices_str}\n\n' \
'You must pick one option. Explain your reasoning step by step before you answer. ' \
'Finally, the last thing you generate should be "ANSWER: (your answer here, include the choice number)"'
elif prompt_style == 'cot+':
if config.get('hint_before_question'):
prompt = f'{ex_str}{context}\n\n{config["hint"]}\n\n{question["question"]}\n\n' \
f'Pick one of the following choices:\n{choices_str}\n\n' \
'You must pick one option. Explain your reasoning step by step before you answer. ' \
'Finally, the last thing you generate should be "ANSWER: (your answer here, including the choice number)"'
else:
prompt = f'{ex_str}{context}\n\n{question["question"]}\n\n' \
f'Pick one of the following choices:\n{choices_str}\n\n' \
f'You must pick one option. {config["hint"]} Explain your reasoning step by step before you answer. ' \
'Finally, the last thing you generate should be "ANSWER: (your answer here, including the choice number)"'
else:
if len(question['intermediate_trees']
) == 0 or config.get('skip_ablated', False):
continue
prompt = f'{ex_str}Answer the following questions given the list of facts per answer choice.\n\n'
for c, t in zip(choices_str.split('\n'),
question['intermediate_trees']):
# extract facts from intermediate_trees
facts = list(
set([
x.value for x in
LogicTree.from_json(t).get_facts(
include_cs=ablation.get(
'include_cs', False),
include_deductions_from_level=-1,
no_facts_after_depth=ablation.get(
'no_facts_after_depth', 3) +
config.get(
'ablation_depth_modifier', 0))
]))
if config.get('allow_sorted_facts', True):
facts = sorted(facts)
facts_str = '\n'.join(
[f'- {fact}' for fact in facts])
prompt += f'Facts for Choice {c}:\n{facts_str}\n\n'
prompt += f'Given the list of facts per answer choice, answer the following question\n\n' \
f'{question["question"]}\n\n' \
f'Pick one of the following choices:\n{choices_str}\n\n' \
'You must pick one option. After you have found the answer, say it in this format "ANSWER: (your answer here, include the choice number)"'
flattened_example = {
'context':
context,
'question_text':
question['question'],
'question':
question,
'answer':
question['answer'],
'choices':
choices_list,
'choices_str':
choices_str,
'intermediate_trees':
question.get('intermediate_trees', []),
'intermediate_data':
question.get('intermediate_data', []),
'prompt':
prompt,
'system_prompt':
config.get('system_prompt', ''),
'gold_answer':
gold_answer,
'scidx':
scidx, # self-consistency index
'self_consistency_n':
self_consistency_n,
'ablation_name':
ablation_name,
}
flattened_data.append(flattened_example)
dataset = Dataset.from_list(flattened_data)
return dataset
@ICL_EVALUATORS.register_module()
class MusrEvaluator(BaseEvaluator):
def __init__(self, answer_index_modifier=1, self_consistency_n=1):
self.answer_index_modifier = answer_index_modifier
self.self_consistency_n = self_consistency_n
def score(self, predictions, references):
correct = 0
assert len(predictions) == len(
references
), 'Predictions and references must have the same length!'
total = len(predictions)
for pred, ref in zip(predictions, references):
if 'ANSWER:' in pred:
answer_line = [
line for line in pred.split('\n') if 'ANSWER:' in line
]
if answer_line:
answer = answer_line[0].split('ANSWER:')[-1].strip()
import re
match = re.search(r'\d+', answer)
if match:
pred_answer = int(match.group())
if pred_answer == ref:
correct += 1
accuracy = 100 * correct / total if total > 0 else 0
return {'accuracy': accuracy}

View File

@ -0,0 +1,53 @@
# flake8: noqa: E501
story = '''
Petra, the dedicated housewife, felt a thrill at the thought of her surprise anniversary dinner for her husband, Daniel. She had been skillfully maneuvering around Daniel's eagerness to pitch in without disappointing him or giving up her surprise.
Daniel, ever-the-observant-husband, noted Petra's unusual enthusiasm about the day's menu. Despite not knowing the details, he appreciated her effort and wanted to helpsilently, he decided to deploy his best skillpatiently awaiting his moment to help, maybe when Petra asked for something from the pantry. Amidst the excitement, there was Clara, their maidever diligent and efficient, trying to keep the surroundings perfect for this special evening.
Tucked away, under the counter, was Petra's secret recipe book, her culinary treasure. Her solace in confusing times, her secret weapon during their flavorful adventures. While toward the back of the pantry, was the glass jar of Petra's favorite spice blendssomething that Daniel was well aware of, in case an opportunity arose for him to assist or distract when Petra might need it.
All three residents of the home were aware of each item's location. The secret recipe book under the counter, the glass jar in the pantry, and the anxious excitement that filled the air—a fragrance even more intoxicating than the delicious smells that would soon fill the kitchen.
With tact and secrecy, Petra relocated her cherished recipe book from its hidden spot under the counter to its temporary home on the kitchen table. The pages were swiftly opened to reveal her secret recipes which she was eager to start preparing for the long-awaited anniversary surprise. While Petra was engrossed in her preparations, Clara continued her sweeping routine in the kitchen. Clara's steady broom strokes on the wooden floor echoed a little in the otherwise busy and humming kitchen. In the background, beyond the kitchen door, Daniel could be seen in the dining room, meticulously setting the table for the anticipated special dinner.
The placement of the rooms allowed Clara to easily notice Petra's movements in her peripheral vision while she was executing her chores. Every move Petra made was observed in Clara's line of sight. Simultaneously, separated by the walls, Daniel was diligently arranging the tableware in the dining room which was separate from Petra's bustling kitchen.
Hoping to spruce up the setting, Daniel delicately relocated a glass jar filled with decorative pebbles to the center of the dining table. His subtle contribution for the evening - a perfectly presentable table for their special anniversary dinner. Amidst the flurry of the special day's preparations, Clara diligently carried on with her duties in the upstairs bathroom, unseen from the dining room. Meanwhile, Petra was wholly engrossed in the allure of a new recipe in her cherished, hidden book which lay opened on the kitchen island, away from prying eyes of the dining room.
In the middle of her usual tidying, Clara spotted Petra's treasured recipe book on the kitchen table. Ensuring it stayed clandestine, Clara carefully transferred it back to its usual hideaway spot beneath the counter. In the midst of the anniversary excitement, Clara deftly transferred Petra's secret weapon back to its hiding place when Daniel stepped out into the garage to retrieve extra utensils. Performing her duty with a sense of urgency, she made sure to move quietly to not disturb Petra, who was engrossed in the process of boiling a massive pot of pasta water on the stove.
Despite the commotion and fervor in the kitchen, the hubbub did not stretch as far as the garage, which remained undisturbed by the domestic activity occurring in the main part of the house. Meanwhile, in the kitchen, Petra was oblivious to Clara's subtle maneuver while she busied herself at the stove, focused on making sure the large pot of water reached the perfect boil.
In the end, the careful orchestration of duties by each individual within the house concluded in a harmonious anniversary celebration. The marks of a successful evening consisted of a delectable meal, a serene atmosphere, and the memory of a smooth, incident-free evening where everyone played their role to perfection.
Based on this story, we want to identify where someone believes that a certain object is at the end of the story. In order to do that, you need to read the story and keep track of where they think the object is at each point. When an object is moved, the person may observe its new location if they saw it move.
To see where an object ends up, they must be able to see the location that it moves to and not be too distracted by what they are doing. If they do not observe the object moving, then they will still believe it to be in the last location where they observed it.
Which location is the most likely place Clara would look to find the glass jar given the story?
Pick one of the following choices:
1 - dining table
2 - kitchen table
3 - pantry
4 - under counter
You must pick one option. Explain your reasoning step by step before you answer. Finally, the last thing you generate should be "ANSWER: (your answer here, including the choice number)"
'''.strip()
reasoning = '''
Let's solve this by thinking step-by-step. We want to know where Clara will check to find the glass jar, so let's track where Clara sees the glass jar throughout the story.
At the beginning of the story, it is stated that "All three residents of the home were aware of each item's location... the glass jar in the pantry." From this, we can conclude that the first place in the story where Clara sees the glass jar is in the pantry.
Throughout the story, the glass jar only moves once to the dining table. However, while Daniel was moving the glass jar, Clara was upstairs in the restroom carrying out her duties. It's highly unlikely that she saw Daniel move the glass jar, so we can assume that she still believes it to be in the pantry.
Clara does go to the kitchen in the story and moves a recipe book from the kitchen table, but because it's the kitchen table and not the dining room table, we can assume she hasn't seen the glass jar there.
Now, given the story and evidence, we can assume that Clara believes the glass jar to be in the pantry.
ANSWER: 3
'''.strip()
object_placements_solved_ex = f'{story}\n\n{reasoning}'

View File

@ -0,0 +1,72 @@
# flake8: noqa: E501
story = '''
In the quaint community of Midvale, the local school stood as a beacon of enlightenment, nurturing the minds of the next generation. The teachers, the lifeblood of this institution, were tasked with the noble duty of education, while the unsung heroesthe maintenance crewensured the smooth functioning of the school's infrastructure. Amidst this, three town residents, Angela, Greg, and Travis, found themselves at a juncture of life where they were presented with the opportunity to serve in one of these crucial roles. The challenge now lay in the hands of the manager, who had to assign them to either teaching or maintenance, a decision that would set the course for their contributions to the school.
Angela was a fiercely independent woman, beset with a unique set of strengths and weaknesses. She was a woman of very few words, often finding it hard to articulate her thoughts and explain things clearly. Venturing into education seemed a maze with her apathetic attitude towards learning. She was also seen to be disinterested in reading and the literary field as a whole. This was a juxtaposition to her inability to contribute to maintenance duties because of her fear of tools and machinery, a sinister remnant of a past accident that still haunted her. The basic handyman skills, which most locals learned growing up, were also absent from her repertoire.
Angela's interactions with Greg and Travis further complicated the equation. On one hand, Greg and Angela had a habit of arguing constantly over trivial matters, which once culminated in their failure to complete a shared basic training exercise adequately. On the other hand, Angela and Travis simply had nothing in common. Their conversations were often fraught with awkward silences, indicative of their lack of shared interests. This lack of coordination was epitomized during a recent team-building exercise when their team finished last.
Greg was the blue-collar type with a broad frame and muscular build. He had a work ethic that never shied away from toiling through the day to get things done. Growing up, he often helped his father with simple home repairs and minor renovations, learning the ropes of basic handiwork. Additionally, Greg had fortified his skills while refurbishing an old shed with Travis, a testament to their compatible personalities. However, his dislike for education was well known throughout town, further amplified by his lack of patience, especially with children.
Travis, the third cog in the wheel, was a man of many peculiarities. His stage fright was almost legendary and made it nearly impossible for him to stand in front of a crowd. Often, the mere thought of it could unnerve him. His physical constitution was lightweight and fragile, and long hours of manual labor made him weary. He also had a revulsion towards dirt that he complained about at every opportune moment. Like the others, studying did not appeal to him much, so much so that he had stopped reading completely after leaving school prematurely.
The manager understood well that a teams success depends heavily on the contribution and compatibility of each member. He observed, analyzed, and considered. Now, it was up to him to assign roles to Angela, Greg, and Travis. The school needed educators and maintenance staff, and each had to play their part perfectly.
Given the story, how would you uniquely allocate each person to make sure both tasks are accomplished efficiently?
Pick one of the following choices:
1 - Teaching: Travis, Maintenance: Angela and Greg
2 - Teaching: Greg, Maintenance: Angela and Travis
3 - Teaching: Angela, Maintenance: Greg and Travis
You must pick one option. The story should allow you to determine how good each person is at a skill. Roughly, each person is either great, acceptable, or bad at a task. We want to find an optimal assignment of people to tasks that uses their skills as well as possible. In addition, one task will have to have two people assigned to it. The effectiveness of their teamwork (great team, acceptable team, or bad team) also impacts the overall quality of the assignment.
When two people need to work on a task and one is bad at it, they dont necessarily benefit from the other person being good, unless they work well together.
With different strengths, weaknesses, and interpersonal dynamics at play, you should allocate your team to find the single assignment to ensure that the tasks overall are completed as effectively as possible.
Explain your reasoning step by step before you answer. Finally, the last thing you generate should be "ANSWER: (your answer here, including the choice number)"
'''.strip()
reasoning = '''
Let's solve this by thinking step-by-step. First, we will figure out each person's skill level for each task. Then, we can measure how well they all work together in pairs. From this, we can find the most efficient assignment that maximizes the scores.
Let's start with Angela. Angela can't articulate her thoughts, and she seems unprepared for teaching. So, let's assume her skill level is 1 for teaching. She also is bad at maintenance due to her fear of maintenance. So, let's assume her skill level is 1 for maintenance as well.
Now, let's look at Greg. Greg has a dislike for education and a lack of patience, so let's assume his skill level for maintenance is 1. However, Greg has helped with home repairs and minor renovations, so let's assume his maintenance skill level is 2.
Finally, let's look at Travis. Travis has extreme stage fright, which will make it difficult to teach, so let's assume his teaching skill level is 1. He also has a lightweight and fragile frame as well as hates dirt, so let's assume his maintenance skill level is 1.
Now, let's look at the relationships and how people work together.
Angela and Greg do not get along; they are constantly arguing, so let's assume their ability to work together is 1.
Angela and Travis aren't much better. They both have nothing in common, and they couldn't do a team-building exercise previously, so let's assume their ability to work together is 1.
Finally, Greg and Travis have worked together, and their personalities seem to meld, so let's assume they work well together with a score of 3.
Let's summarize and figure out the best assignment.
Angela is bad at teaching. (1)
Angela is bad at maintenance. (1)
Angela does not work well with Greg. (1)
Angela does not work well with Travis. (1)
Greg is bad at teaching. (1)
Greg is okay with maintenance. (2)
Greg and Travis work well together. (3)
Travis is bad at teaching. (1)
Travis is bad at maintenance. (1)
Now, let's find the best assignment.
Option 1: Travis as a teacher (1) + Angela working in maintenance (1) + Greg working in maintenance (2) + Angela and Greg work badly together (1) = 5
Option 2: Greg as a teacher (1) + Angela working in maintenance (1) + Travis working in maintenance (1) + Angela and Travis work badly together (1) = 4
Option 3: Angela as a teacher (1) + Greg working in maintenance (2) + Travis working in maintenance (1) + Greg and Travis work well together (3) = 7
So, from this, we can see Option 3 has the maximum score.
ANSWER: 3
'''.strip()
team_allocation_solved_ex = f'{story}\n\n{reasoning}'

View File

@ -0,0 +1,739 @@
# flake8: noqa: E501
"""WARNING (or more like an aggressive note).
A lot of functionality was implemented here for earlier experiments. Most of which is not used. We have left it here
for backwards compatibility with the current dataset as well as because why not.
ALSO NOTE:
This file was created to have no dependencies on anything in the repo for a reason. You can copy this file into your
own project and use the classes to parse/visualize/edit the logic trees in the dataset or create your own.
FINAL NOTE:
See examples of how to create LogicNodes and LogicTrees in the __main__ part of the file.
"""
import random
from copy import deepcopy
from enum import Enum
from typing import Any, Dict, List
import numpy as np
class LogicNodeOperatorType:
"""How should the deduction combine the nodes (choose will randomly sample
and/or when populate is called)"""
AND = 'and'
OR = 'or'
CHOOSE = 'choose'
class LogicNodeFactType:
"""Is a node explicit (mentioned in the story) or commonsense knowledge
(left unsaid)"""
EXPLICIT = 'explicit'
COMMONSENSE = 'commonsense'
class LogicNodeConstraints:
"""Useful for things like children = ['X is the murderer', 'Y is the murderer', 'Z is the murderer'], we no longer use this structure though."""
ONLY_ONE_CAN_BE_TRUE = 'Only one child can be true'
class LogicNodeDeductionType:
"""What type of deduction should be used here (not used currently)"""
SYLLOGISM = 'syllogism'
TEMPORAL = 'temporal'
SPATIAL = 'spatial'
CHOOSE = 'choose'
class LogicNode:
"""A LogicNode is a tree primitive.
It is either a deduction or a leaf fact. Leaf facts are the ones that we
use in story generation (if they are explicit facts and not commonsense).
"""
value: str
children: List['LogicNode']
fact_type: str
operator: str
constraints: List[str]
deduction_type: str
prunable: bool
can_be_leaf: bool
def __init__(
self,
value: str = '',
children: List['LogicNode'] = None,
operator: str = LogicNodeOperatorType.OR,
fact_type: str = LogicNodeFactType.EXPLICIT,
constraints: List[str] = (),
deduction_type: str = None,
prunable: bool = True,
can_be_leaf: bool = False,
frozen: bool = False,
):
"""
:param value: Content for this specific node (also the deduction of the children).
:param children: The children for this node.
:param operator: Should the children be "And"ed or "Or"ed to create the deduction (the content of this node).
:param fact_type: Explicit or commonsense
:param constraints: Not used anymore (see LogicNodeConstraints)
:param deduction_type: Not used anymore (see LogicNodeDeductionType)
:param prunable: Can this node be removed from the tree (we don't prune in our datasets)
:param can_be_leaf: Can this node be a leaf node (usually false for nodes that you are injecting manually)
:param frozen: Should we add/prune children in the populate function (if frozen, no children will be added or removed, but the children may have children appended/pruned from them).
"""
self.value = value
if children is None:
children = []
self.children = children
self.operator = operator
self.fact_type = fact_type
self.constraints = constraints
self.deduction_type = deduction_type
self.prunable = prunable
self.can_be_leaf = can_be_leaf
self.frozen = frozen
self.parent = None
@property
def children(self):
return self._children
@children.setter
def children(self, children: List['LogicNode']):
self._children = children
for c in self.children:
c.parent = self
def __str__(self):
line = []
cnsts = ', '.join([str(x.value) for x in self.constraints])
if self.value and self.value != '':
line.append(self.value)
if len(self.children) > 0:
line.append(self.operator)
else:
line.append(self.fact_type)
if self.deduction_type:
line.append(self.deduction_type)
if len(self.constraints) > 0:
line.append(cnsts)
if len(self.children) > 0:
line.append(f'children: {len(self.children)}')
return ' | '.join(line)
def __repr__(self):
return str(self)
def to_json(self):
return {
'value': self.value,
'children': [x.to_json() for x in self.children],
'fact_type': self.fact_type,
'operator': self.operator,
'constraints': self.constraints,
'deduction_type': self.deduction_type,
'prunable': self.prunable,
'can_be_leaf': self.can_be_leaf
}
@classmethod
def from_json(cls, js):
js['children'] = [LogicNode.from_json(x) for x in js['children']]
return cls(**js)
class LogicTree:
"""Main datastructure used when creating a MuSR example.
It's basically a standard tree with some parameters controlling the shape.
"""
nodes: List[LogicNode]
chance_of_or: float
chance_of_cs_fact: float
depth: int
chance_to_prune: float
chance_to_prune_all: float
bf_factor: Dict[int, float]
deduction_type_sample_rate: Dict[LogicNodeDeductionType, float]
root_structure: List[List[LogicNode]] = ()
def __init__(self,
chance_of_or: float = 0.3,
chance_of_cs_fact: float = 0.1,
depth: int = 2,
chance_to_prune: float = 0.6,
chance_to_prune_all: float = 0.2,
bf_factor: Dict[int, float] = None,
deduction_type_sample_rate: Dict[LogicNodeDeductionType,
float] = None,
enforce_cs_fact_per_level: bool = False,
root_structure: List[Any] = (),
nodes: List[LogicNode] = (),
populate: bool = True,
prune: bool = True):
"""
:param chance_of_or: (not used) how often should a node with children be an OR
:param chance_of_cs_fact: (not used) how often should there be a commonsense node
:param depth: How deep should a tree go
:param chance_to_prune: Percentage chance of pruning a node
:param chance_to_prune_all: Percentage chance of pruning all children from a node.
:param bf_factor: Branching factor (dictionary of percentages {1: 0.33, 2:0.33, 3:0.33} for example.
:param deduction_type_sample_rate: (not used, see bf_factor and LogicNodeDeductionType)
:param enforce_cs_fact_per_level: Enforce 1 commonsense fact per level in the tree (we use this instead of chance_of_cs_fact)
:param root_structure: List of LogicNodes to build off of.
:param nodes: List of LogicNodes to define the LogicTree on (we will not populate/prune the tree if this is filled)
:param populate: Should we populate children for the tree according to the other parameters?
:param prune: Should we prune the children for the tree according to the other parameters?
"""
self.chance_of_or = chance_of_or
self.chance_of_cs_fact = chance_of_cs_fact
self.depth = depth
self.chance_to_prune = chance_to_prune
self.chance_to_prune_all = chance_to_prune_all
self.bf_factor = bf_factor
self.enforce_cs_fact_per_level = enforce_cs_fact_per_level
if not bf_factor:
self.bf_factor = {2: 0.8, 3: 0.2}
if not deduction_type_sample_rate:
deduction_type_sample_rate = {
LogicNodeDeductionType.SYLLOGISM: 1.0
}
self.deduction_type_sample_rate = deduction_type_sample_rate
self.root_structure = root_structure
if len(nodes) > 0:
self.nodes = nodes
else:
if root_structure is not None and len(root_structure) > 0:
self.nodes = root_structure
else:
self.nodes = [
LogicNode('root', operator=LogicNodeOperatorType.AND)
]
if populate:
[self.populate(x, 1) for x in self.nodes]
if prune:
[self.prune(x, 1) for x in self.nodes]
def __str__(self):
return self.print_tree()
def get_facts(self,
include_cs: bool = False,
include_deductions_from_level: int = -1,
no_facts_after_depth: int = -1):
"""Get a list of LogicNodes from the tree. By default, you will get the
explicit leaf nodes.
:param include_cs: Include the commonsense nodes from all levels.
:param include_deductions_from_level: Include any intermediate deduction nodes from the specified level and deeper.
:param no_facts_after_depth: Essentially tree the deductions at the specified depth as leaf nodes.
"""
def recurse_facts(_node: LogicNode, depth: int = 0) -> List[str]:
node = deepcopy(_node)
if depth >= no_facts_after_depth and no_facts_after_depth > -1:
node.children = []
facts = []
if node.fact_type == LogicNodeFactType.EXPLICIT and len(
node.children) == 0:
facts.append(node)
if node.fact_type == LogicNodeFactType.COMMONSENSE and include_cs and len(
node.children) == 0:
facts.append(node)
if len(
node.children
) > 0 and include_deductions_from_level <= depth and include_deductions_from_level > -1:
facts.append(node)
for child in node.children:
facts.extend(recurse_facts(child, depth + 1))
return list(set(facts))
facts = []
for n in self.nodes:
facts.extend(recurse_facts(n))
return facts
def print_tree(self, node=None, level=0):
"""Deprecated (not used)"""
if node is None:
node = self.nodes[0]
line = '-' * level * 4 + str(node) + (' | ' + str(node.operator) if
len(node.children) > 0 else '')
for child in node.children:
line += '\n' + self.print_tree(child, level + 1)
return line
def print_for_gpt(self,
node=None,
level=0,
pad_char=' ',
pad_space=4,
print_forward=True,
print_conjection_types: bool = False,
print_reasoning_types: bool = False,
ignore_value_after_depth: int = -1,
print_only_nodes_with_value: bool = False):
"""Complex print function. We often use it as
print_for_gpt(pad_space=1, pad_char='> ')
However, more complex arguments can be used to control what is printed.
This returns a string that must be printed (don't be confused by the method name.)
:param node: Start at a specific node.
:param level: Controls how much tabbing is done when printing the current node.
:param pad_char: Char to use that specifies depth ('> ' at depth 3 will look like '> > > ' if you have pad_space equal to 1 for example)
:param pad_space: How many spaces to include between pad_chars
:param print_forward: Print the tree with parent nodes first.
:param print_conjection_types: Print the Ands and Ors per deduction (not used)
:param print_reasoning_types: Print the deduction types (not used)
:param ignore_value_after_depth: Ignore content of the nodes once a depth is met
:param print_only_nodes_with_value: Ignore nodes without content.
"""
line = ''
if node is None:
node = self.nodes[0]
if not print_forward:
for child in node.children:
v = self.print_for_gpt(
child,
level + 1,
pad_char=pad_char,
pad_space=pad_space,
print_forward=print_forward,
ignore_value_after_depth=ignore_value_after_depth,
print_only_nodes_with_value=print_only_nodes_with_value)
if v != '':
line += v + '\n'
ignore_val = ignore_value_after_depth > -1 and ignore_value_after_depth < level
ignore_line = print_only_nodes_with_value and node.value == ''
if ignore_line:
line_val = ''
else:
line_val = (node.value + ' | ' if node.value != '' and not ignore_val else '') + (
('Fact From Story' if node.fact_type == LogicNodeFactType.EXPLICIT else 'Commonsense Knowledge') \
if len(node.children) == 0 else 'Deduced Fact')
if level == 0:
line_val = (node.value + ' | ' if node.value != '' else
'') + 'Deduced Root Conclusion'
if len(node.children) > 0 and (print_conjection_types
or print_reasoning_types):
if print_conjection_types:
line_val += f' ({node.operator}'
else:
line_val += f'('
if node.deduction_type and print_reasoning_types:
line_val += f' | {node.deduction_type})'
else:
line_val += ')'
if len(node.constraints) > 0:
cnsts = ', '.join([str(x) for x in node.constraints])
line_val += f' constraints: [{cnsts}]'
line += pad_char * level * pad_space + line_val
if print_forward:
for child in node.children:
v = self.print_for_gpt(
child,
level + 1,
pad_char=pad_char,
pad_space=pad_space,
print_forward=print_forward,
ignore_value_after_depth=ignore_value_after_depth,
print_only_nodes_with_value=print_only_nodes_with_value)
if v != '':
line += '\n' + v
return line
def populate(self, node: LogicNode, current_depth: int = 1):
if node.operator == LogicNodeOperatorType.CHOOSE:
node.operator = LogicNodeOperatorType.OR \
if random.random() < self.chance_of_or else LogicNodeOperatorType.AND
if node.deduction_type == LogicNodeDeductionType.CHOOSE:
if node.operator != LogicNodeOperatorType.AND:
node.deduction_type = None
else:
node.deduction_type = random.choices(
list(self.deduction_type_sample_rate.keys()),
list(self.deduction_type_sample_rate.values()),
k=1)[0]
if not node.frozen:
bf = max(
0,
random.choices(list(self.bf_factor.keys()),
list(self.bf_factor.values()),
k=1)[0] - len(node.children))
if bf > 0:
new_nodes = []
one_fact_is_cs = False
for idx in range(bf):
roll_for_or = random.random()
fact_type = LogicNodeFactType.COMMONSENSE \
if random.random() < self.chance_of_cs_fact and not one_fact_is_cs else \
LogicNodeFactType.EXPLICIT
if roll_for_or > self.chance_of_or and\
current_depth < self.depth and\
not fact_type == LogicNodeFactType.COMMONSENSE:
new_nodes.append(
LogicNode(
f'',
operator=LogicNodeOperatorType.AND,
fact_type=fact_type,
deduction_type=random.choices(
list(self.deduction_type_sample_rate.keys(
)),
list(self.deduction_type_sample_rate.
values()),
k=1)[0],
prunable=True,
can_be_leaf=True,
))
else:
new_nodes.append(
LogicNode(f'',
operator=LogicNodeOperatorType.OR,
fact_type=fact_type,
prunable=True,
can_be_leaf=True))
if fact_type == LogicNodeFactType.COMMONSENSE:
node.operator = LogicNodeOperatorType.AND
if not node.deduction_type:
node.deduction_type = random.choices(
list(self.deduction_type_sample_rate.keys()),
list(self.deduction_type_sample_rate.values()),
k=1)[0]
one_fact_is_cs = True
if not one_fact_is_cs and self.enforce_cs_fact_per_level:
new_nodes.append(
LogicNode(f'',
operator=LogicNodeOperatorType.OR,
fact_type=LogicNodeFactType.COMMONSENSE,
prunable=False,
can_be_leaf=True))
node.children.extend(new_nodes)
if current_depth < self.depth:
for node in node.children:
if node.fact_type == LogicNodeFactType.COMMONSENSE:
continue
self.populate(node, current_depth + 1)
def prune(self, node: LogicNode, current_depth: int = 1):
to_prune = []
if current_depth > 1 and node.can_be_leaf:
if random.random() < self.chance_to_prune_all:
node.children = []
return
prunable = [x for x in node.children if x.prunable]
if (len(prunable) > 1 and node.operator == LogicNodeOperatorType.OR or\
len(prunable) > 2 and node.operator == LogicNodeOperatorType.AND) and\
current_depth <= self.depth:
if node.prunable:
for n in random.sample(
prunable,
len(prunable) -
(1 if node.operator == LogicNodeOperatorType.OR else 2)):
roll_to_prune = random.random()
if roll_to_prune < self.chance_to_prune:
to_prune.append(n)
node.children = [x for x in node.children if x not in to_prune]
for n in node.children:
self.prune(n, current_depth + 1)
def to_json(self):
args = {
'chance_of_or': self.chance_of_or,
'depth': self.depth,
'chance_to_prune': self.chance_to_prune,
'chance_to_prune_all': self.chance_to_prune_all,
'bf_factor': self.bf_factor,
'deduction_type_sample_rate': self.deduction_type_sample_rate,
'root_structure': [x.to_json() for x in self.root_structure],
'nodes': [x.to_json() for x in self.nodes]
}
return args
@classmethod
def from_json(cls, _js):
js = deepcopy(_js)
js['nodes'] = [LogicNode.from_json(x) for x in js['nodes']]
js['root_structure'] = [
LogicNode.from_json(x) for x in js['root_structure']
]
return cls(**js)
if __name__ == '__main__':
"""EXAMPLE USES."""
def tv_scene_ex():
root_structure = [
LogicNode('A good drama tv scene',
operator=LogicNodeOperatorType.OR,
prunable=False,
can_be_leaf=False,
frozen=True)
]
root_structure[0].children = [
LogicNode('Bob is sad.',
operator=LogicNodeOperatorType.CHOOSE,
prunable=True,
can_be_leaf=False),
LogicNode('John now hates Bob.',
operator=LogicNodeOperatorType.CHOOSE,
prunable=True,
can_be_leaf=False),
LogicNode('Bob bought a car.',
operator=LogicNodeOperatorType.CHOOSE,
prunable=True,
can_be_leaf=False),
LogicNode('Bob wanted to be happy.',
operator=LogicNodeOperatorType.CHOOSE,
prunable=True,
can_be_leaf=False),
]
tree = LogicTree(depth=4,
root_structure=root_structure,
bf_factor={
1: 0.5,
2: 0.5
},
chance_of_or=0.0,
chance_of_cs_fact=0.0,
chance_to_prune_all=0.5,
chance_to_prune=0.5,
enforce_cs_fact_per_level=True)
rep = tree.print_for_gpt(pad_space=1, pad_char='- ')
print(rep)
def eb_ex():
root_structure = [
LogicNode('',
operator=LogicNodeOperatorType.CHOOSE,
prunable=False,
can_be_leaf=False)
]
n = LogicNode('Eruptions block sunlight.',
operator=LogicNodeOperatorType.CHOOSE,
prunable=False,
can_be_leaf=False,
frozen=True)
n.children = [
LogicNode('Eruptions produce ash clouds.',
operator=LogicNodeOperatorType.CHOOSE,
prunable=False,
can_be_leaf=True,
frozen=True),
LogicNode('Ash blocks sunlight.',
operator=LogicNodeOperatorType.CHOOSE,
prunable=False,
can_be_leaf=True,
frozen=True),
]
g = LogicNode('Eruptions can cause plants to die.',
operator=LogicNodeOperatorType.CHOOSE,
prunable=True,
can_be_leaf=False,
frozen=True)
g.children = [
n,
LogicNode('Producers will die without sunlight.',
operator=LogicNodeOperatorType.CHOOSE,
prunable=False,
can_be_leaf=True,
frozen=True)
]
l = LogicNode('',
operator=LogicNodeOperatorType.AND,
prunable=False,
can_be_leaf=False)
l.children = [g]
root_structure[0].children = [l]
tree = LogicTree(depth=5,
root_structure=root_structure,
bf_factor={
1: 0.3,
2: 0.7
},
chance_of_or=0.0,
chance_of_cs_fact=0.0,
chance_to_prune_all=0.0,
chance_to_prune=0.0,
enforce_cs_fact_per_level=True)
rep = tree.print_for_gpt(pad_space=1, pad_char='- ')
print(rep)
def murder_mystery_ex():
root_structure = [
LogicNode('Killer',
operator=LogicNodeOperatorType.OR,
constraints=[LogicNodeConstraints.ONLY_ONE_CAN_BE_TRUE],
prunable=False,
can_be_leaf=False,
frozen=True)
]
suspect_nodes = [
LogicNode(f'Murderer Suspect {idx + 1}',
operator=LogicNodeOperatorType.AND,
prunable=False,
can_be_leaf=False,
frozen=True) for idx in range(1)
]
for s in suspect_nodes:
s.children = [
LogicNode('Suspect has means',
operator=LogicNodeOperatorType.CHOOSE,
prunable=True,
can_be_leaf=False),
LogicNode('Suspect has motive',
operator=LogicNodeOperatorType.CHOOSE,
prunable=True,
can_be_leaf=False),
LogicNode('Suspect has opportunity',
operator=LogicNodeOperatorType.CHOOSE,
prunable=True,
can_be_leaf=False)
]
root_structure[0].children = suspect_nodes
tree = LogicTree(depth=4,
root_structure=root_structure,
bf_factor={
1: 0.5,
2: 0.5
},
chance_of_or=0.0,
chance_of_cs_fact=0.0,
chance_to_prune_all=0.5,
chance_to_prune=0.5,
enforce_cs_fact_per_level=True)
rep = tree.print_for_gpt(pad_space=1, pad_char='> ')
print(rep)
def action_ex():
root_structure = [
LogicNode('Take an action',
operator=LogicNodeOperatorType.OR,
prunable=False,
can_be_leaf=False,
frozen=True)
]
root_structure[0].children = [
LogicNode('Run away',
operator=LogicNodeOperatorType.CHOOSE,
prunable=False,
can_be_leaf=False,
frozen=True),
LogicNode('Fight back',
operator=LogicNodeOperatorType.CHOOSE,
prunable=False,
can_be_leaf=False,
frozen=True),
LogicNode('Hide',
operator=LogicNodeOperatorType.CHOOSE,
prunable=False,
can_be_leaf=False,
frozen=True),
]
for cidx, c in enumerate(root_structure[0].children):
nfacts = random.randint(2, 4)
for n in range(nfacts):
fact = LogicNode('',
operator=LogicNodeOperatorType.CHOOSE,
prunable=False,
can_be_leaf=False,
frozen=True)
fact.children = [
LogicNode('Pro (supporting the parent action)',
operator=LogicNodeOperatorType.CHOOSE,
prunable=True,
can_be_leaf=False,
frozen=False),
LogicNode('Con (counters the sibling Pro only)',
operator=LogicNodeOperatorType.CHOOSE,
prunable=True,
can_be_leaf=False,
frozen=False)
]
root_structure[0].children[cidx].children.append(fact)
tree = LogicTree(depth=4,
root_structure=root_structure,
bf_factor={
1: 0.25,
2: 0.5,
3: 0.25
},
chance_of_or=0.0,
chance_of_cs_fact=0.0,
chance_to_prune_all=0.5,
chance_to_prune=0.75,
enforce_cs_fact_per_level=True)
rep = tree.print_for_gpt(pad_space=1, pad_char='- ')
print(rep)
tv_scene_ex()
eb_ex()
action_ex()

View File

@ -327,6 +327,11 @@ DATASETS_MAPPING = {
"hf_id": "",
"local": "./data/mmmlu_lite",
},
"opencompass/musr": {
"ms_id": "",
"hf_id": "",
"local": "./data/musr",
},
"opencompass/babilong": {
"ms_id": "",
"hf_id": "",
@ -335,6 +340,10 @@ DATASETS_MAPPING = {
}
DATASETS_URL = {
"/musr": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/musr.zip",
"md5": "7447d2a5bec4586035196102135e2af9",
},
"/mmlu/": {
"url": "http://opencompass.oss-cn-shanghai.aliyuncs.com/datasets/data/mmlu.zip",
"md5": "761310671509a239e41c4b717f7fab9c",