mirror of https://github.com/open-compass/opencompass.git synced 2025-05-30 16:03:24 +08:00

History

Myhs_phz 6118596362 [Feature] Add recommendation configs for datasets (#1937 ) * feat datasetrefine drop * fix datasets in fullbench_int3 * fix * fix * back * fix * fix and doc * feat * fix hook * fix * fix * fix * fix * fix * fix * fix * fix * fix * doc * fix * fix * Update dataset-index.yml		2025-03-25 14:54:13 +08:00
..
korbench_gen.py	[Feature] Add recommendation configs for datasets (#1937 )	2025-03-25 14:54:13 +08:00
korbench_llm_judge_gen.py	[Feature] Add recommendation configs for datasets (#1937 )	2025-03-25 14:54:13 +08:00
korbench_llmjudge_gen_56cf43.py	[Feature] Add recommendation configs for datasets (#1937 )	2025-03-25 14:54:13 +08:00
korbench_llmjudge_gen_17854d.py	[Update] Add dataset configurations of no max_out_len (#1967 )	2025-03-24 14:24:12 +08:00
korbench_mixed_gen_d00bdd.py	[Refactor] Code refactoarization (#1831 )	2025-01-20 19:17:38 +08:00
korbench_single_0_shot_gen.py	[Feature] Add recommendation configs for datasets (#1937 )	2025-03-25 14:54:13 +08:00
korbench_single_0shot_genericllmeval_gen_17854d.py	[Feature] Add recommendation configs for datasets (#1937 )	2025-03-25 14:54:13 +08:00
korbench_single_0shot_genericllmeval_xml_gen_17854d.py	[Refactor] Code refactoarization (#1831 )	2025-01-20 19:17:38 +08:00
korbench_single_0shot_llmjudge_gen.py	[Refactor] Code refactoarization (#1831 )	2025-01-20 19:17:38 +08:00
korbench_single_3_shot_gen.py	[Refactor] Code refactoarization (#1831 )	2025-01-20 19:17:38 +08:00
readme.md	[Update] Korbench readme supplementation (#1734 )	2024-12-05 11:24:35 +08:00

readme.md

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

KOR-Bench is a dataset designed to evaluate large language models (LLMs) on tasks that require reasoning independent of prior knowledge. Created to assess reasoning and planning abilities, KOR-Bench introduces rule-based tasks that minimize the influence of pretrained knowledge, enabling a focused evaluation of intrinsic model capabilities.

Overview

Purpose

Large language models, such as GPT-4 and Claude, excel in knowledge-based tasks but face challenges in applying reasoning skills to unfamiliar scenarios. KOR-Bench is built to evaluate such reasoning capabilities across five categories:

Operation: Arithmetic and logical operations.
Logic: Complex deductive and inductive reasoning.
Cipher: Code-breaking and pattern discovery.
Puzzle: Problem-solving with creative and logical reasoning.
Counterfactual: Hypothetical reasoning in alternate scenarios.

Dataset Construction

KOR-Bench tasks are designed with novel rules and configurations, ensuring no reliance on pretrained knowledge. Each task includes:

Rules: Custom rule sets to guide reasoning.
Questions: Carefully crafted problems that require the application of rules.
Evaluation Scenarios: Zero-shot, three-shot, and subquestion-specific configurations.

The dataset is structured to assess multistep reasoning, pattern recognition, and adaptability to new rules.

Dataset Access

KOR-Bench is publicly available with detailed usage instructions in the GitHub Repository. Download the dataset and leverage predefined evaluation scripts or customize your own.

Evaluation

Install dependencies and configure your environment.
Run evaluations using opencompass configs/eval_korbench.py to assess LLM performance.
Analyze model performance across various reasoning tasks.

Example Command

opencompass configs/eval_korbench.py

Baselines and Results

KOR-Bench includes baseline results for leading LLMs evaluated across various configurations, including zero-shot (gen) and few-shot modes. Below is a summary of the results.

dataset	version	metric	mode	internlm2_5-7b-chat-turbomind	internlm2_5-1_8b-chat-turbomind	llama-3_1-8b-instruct-turbomind	glm-4-9b-chat-turbomind	gemma-2-9b-it-turbomind
korbench_mixed_Multi-Q	21f998	accuracy	gen	0.60	0.20	9.60	8.70	7.80
korbench_mixed_Multi-R	21f998	accuracy	gen	1.70	0.10	8.80	12.10	9.80
korbench_mixed_Multi-RQ	21f998	accuracy	gen	1.50	0.10	6.40	8.60	6.00
korbench_cipher	21f998	accuracy	gen	8.80	0.80	14.00	6.80	6.40
korbench_counterfactual	21f998	accuracy	gen	83.60	17.20	88.80	90.40	87.60
korbench_logic	21f998	accuracy	gen	8.40	3.60	37.60	38.80	40.80
korbench_operation	21f998	accuracy	gen	56.00	25.20	68.40	63.60	67.60
korbench_puzzle	21f998	accuracy	gen	3.60	0.00	3.20	3.20	5.60
korbench_cipher	21f998	accuracy	fewshot	8.40	3.20	9.60	9.20	9.60
korbench_counterfactual	21f998	accuracy	fewshot	87.60	58.00	23.60	89.60	84.40
korbench_logic	21f998	accuracy	fewshot	45.20	19.60	24.40	38.40	54.00
korbench_operation	21f998	accuracy	fewshot	24.80	11.20	73.20	67.20	23.20
korbench_puzzle	21f998	accuracy	fewshot	4.80	2.40	1.60	3.60	6.80

Citation

BibTeX:

@misc{ma2024korbenchbenchmarkinglanguagemodels,
title={KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks}, 
author={Kaijing Ma and Xinrun Du and Yunran Wang and Haoran Zhang and Zhoufutu Wen and Xingwei Qu and Jian Yang and Jiaheng Liu and Minghao Liu and Xiang Yue and Wenhao Huang and Ge Zhang},
year={2024},
eprint={2410.06526},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2410.06526}, 
}