mirror of https://github.com/open-compass/opencompass.git synced 2025-05-30 16:03:24 +08:00

History

Fengzhe Zhou a32f21a356 [Sync] Sync with internal codes 2024.06.28 (#1279 )		2024-06-28 14:16:34 +08:00
..
few-shot-examples	support CHARM (https://github.com/opendatalab/CHARM ) reasoning tasks (#1190 )	2024-05-27 13:48:22 +08:00
few-shot-examples_Translate-EN	support CHARM (https://github.com/opendatalab/CHARM ) reasoning tasks (#1190 )	2024-05-27 13:48:22 +08:00
charm_reason_cot_only_gen_f7b7d3.py	[Sync] bump version (#1204 )	2024-05-28 23:09:59 +08:00
charm_reason_gen_f8fca2.py	[Sync] bump version (#1204 )	2024-05-28 23:09:59 +08:00
charm_reason_gen.py	[Sync] bump version (#1204 )	2024-05-28 23:09:59 +08:00
charm_reason_ppl_3da4de.py	[Sync] Sync with internal codes 2024.06.28 (#1279 )	2024-06-28 14:16:34 +08:00
charm_reason_settings.py	[Sync] bump version (#1204 )	2024-05-28 23:09:59 +08:00
README_ZH.md	support CHARM (https://github.com/opendatalab/CHARM ) reasoning tasks (#1190 )	2024-05-27 13:48:22 +08:00
README.md	support CHARM (https://github.com/opendatalab/CHARM ) reasoning tasks (#1190 )	2024-05-27 13:48:22 +08:00

README.md

CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024]

📃Paper 🏰Project Page 🏆Leaderboard ✨Findings

📖 中文 | English

Dataset Description

CHARM is the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. In addition, the CHARM can evaluate the LLMs' memorization-independent reasoning abilities and analyze the typical errors.

Comparison of commonsense reasoning benchmarks

Benchmarks	CN-Lang	CSR	CN-specifics	Dual-Domain	Rea-Mem
Most benchmarks in davis2023benchmarks	✘	✔	✘	✘	✘
XNLI, XCOPA,XStoryCloze	✔	✔	✘	✘	✘
LogiQA, CLUE, CMMLU	✔	✘	✔	✘	✘
CORECODE	✔	✔	✘	✘	✘
CHARM (ours)	✔	✔	✔	✔	✔

"CN-Lang" indicates the benchmark is presented in Chinese language. "CSR" means the benchmark is designed to focus on CommonSense Reasoning. "CN-specific" indicates the benchmark includes elements that are unique to Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and format. "Rea-Mem" indicates the benchmark includes closely-interconnected reasoning and memorization tasks.

🛠️ How to Use

Below are the steps for quickly downloading CHARM and using OpenCompass for evaluation.

1. Download CHARM

git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}

2. Run Inference and Evaluation

cd ${path_to_opencompass}
mkdir -p data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM

# Infering and evaluating CHARM with hf_llama3_8b_instruct model
python run.py --models hf_llama3_8b_instruct --datasets charm_gen

🖊️ Citation

@misc{sun2024benchmarking,
      title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
      author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
      year={2024},
      eprint={2403.14112},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}