.. | ||
few-shot-examples | ||
few-shot-examples_Translate-EN | ||
charm_reason_cot_only_gen_f7b7d3.py | ||
charm_reason_gen_f8fca2.py | ||
charm_reason_gen.py | ||
charm_reason_ppl_3da4de.py | ||
charm_reason_settings.py | ||
README_ZH.md | ||
README.md |
CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024]
Dataset Description
CHARM is the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. In addition, the CHARM can evaluate the LLMs' memorization-independent reasoning abilities and analyze the typical errors.
Comparison of commonsense reasoning benchmarks
<html lang="en">Benchmarks | CN-Lang | CSR | CN-specifics | Dual-Domain | Rea-Mem |
---|---|---|---|---|---|
Most benchmarks in davis2023benchmarks | ✘ | ✔ | ✘ | ✘ | ✘ |
XNLI, XCOPA,XStoryCloze | ✔ | ✔ | ✘ | ✘ | ✘ |
LogiQA, CLUE, CMMLU | ✔ | ✘ | ✔ | ✘ | ✘ |
CORECODE | ✔ | ✔ | ✘ | ✘ | ✘ |
CHARM (ours) | ✔ | ✔ | ✔ | ✔ | ✔ |
"CN-Lang" indicates the benchmark is presented in Chinese language. "CSR" means the benchmark is designed to focus on CommonSense Reasoning. "CN-specific" indicates the benchmark includes elements that are unique to Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and format. "Rea-Mem" indicates the benchmark includes closely-interconnected reasoning and memorization tasks.
🛠️ How to Use
Below are the steps for quickly downloading CHARM and using OpenCompass for evaluation.
1. Download CHARM
git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}
2. Run Inference and Evaluation
cd ${path_to_opencompass}
mkdir -p data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM
# Infering and evaluating CHARM with hf_llama3_8b_instruct model
python run.py --models hf_llama3_8b_instruct --datasets charm_gen
🖊️ Citation
@misc{sun2024benchmarking,
title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
year={2024},
eprint={2403.14112},
archivePrefix={arXiv},
primaryClass={cs.CL}
}