# CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024] [![arXiv](https://img.shields.io/badge/arXiv-2403.14112-b31b1b.svg)](https://arxiv.org/abs/2403.14112) [![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE)

📃[Paper](https://arxiv.org/abs/2403.14112) 🏰[Project Page](https://opendatalab.github.io/CHARM/) 🏆[Leaderboard](https://opendatalab.github.io/CHARM/leaderboard.html) ✨[Findings](https://opendatalab.github.io/CHARM/findings.html)

📖 中文 | English

## Dataset Description **CHARM** is the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. In addition, the CHARM can evaluate the LLMs' memorization-independent reasoning abilities and analyze the typical errors. ## Comparison of commonsense reasoning benchmarks

Benchmarks	CN-Lang	CSR	CN-specifics	Dual-Domain	Rea-Mem
Most benchmarks in davis2023benchmarks	✘	✔	✘	✘	✘
XNLI, XCOPA,XStoryCloze	✔	✔	✘	✘	✘
LogiQA, CLUE, CMMLU	✔	✘	✔	✘	✘
CORECODE	✔	✔	✘	✘	✘
CHARM (ours)	✔	✔	✔	✔	✔

"CN-Lang" indicates the benchmark is presented in Chinese language. "CSR" means the benchmark is designed to focus on CommonSense Reasoning. "CN-specific" indicates the benchmark includes elements that are unique to Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and format. "Rea-Mem" indicates the benchmark includes closely-interconnected reasoning and memorization tasks. ## 🛠️ How to Use Below are the steps for quickly downloading CHARM and using OpenCompass for evaluation. ### 1. Download CHARM ```bash git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo} ``` ### 2. Run Inference and Evaluation ```bash cd ${path_to_opencompass} mkdir -p data ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM # Infering and evaluating CHARM with hf_llama3_8b_instruct model python run.py --models hf_llama3_8b_instruct --datasets charm_gen ``` ## 🖊️ Citation ```bibtex @misc{sun2024benchmarking, title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations}, author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He}, year={2024}, eprint={2403.14112}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```