# CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024] [![arXiv](https://img.shields.io/badge/arXiv-2403.14112-b31b1b.svg)](https://arxiv.org/abs/2403.14112) [![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE)
πŸ“ƒ[Paper](https://arxiv.org/abs/2403.14112) 🏰[Project Page](https://opendatalab.github.io/CHARM/) πŸ†[Leaderboard](https://opendatalab.github.io/CHARM/leaderboard.html) ✨[Findings](https://opendatalab.github.io/CHARM/findings.html)
πŸ“– δΈ­ζ–‡ | English
## Dataset Description **CHARM** is the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. In addition, the CHARM can evaluate the LLMs' memorization-independent reasoning abilities and analyze the typical errors. ## Comparison of commonsense reasoning benchmarks
Benchmarks CN-Lang CSR CN-specifics Dual-Domain Rea-Mem
Most benchmarks in davis2023benchmarks
XNLI, XCOPA,XStoryCloze
LogiQA, CLUE, CMMLU
CORECODE
CHARM (ours)
"CN-Lang" indicates the benchmark is presented in Chinese language. "CSR" means the benchmark is designed to focus on CommonSense Reasoning. "CN-specific" indicates the benchmark includes elements that are unique to Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and format. "Rea-Mem" indicates the benchmark includes closely-interconnected reasoning and memorization tasks. ## πŸ› οΈ How to Use Below are the steps for quickly downloading CHARM and using OpenCompass for evaluation. ### 1. Download CHARM ```bash git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo} cd ${path_to_opencompass} mkdir data ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM ``` ### 2. Run Inference and Evaluation ```bash cd ${path_to_opencompass} # modify config file `configs/eval_charm_rea.py`: uncomment or add models you want to evaluate python run.py configs/eval_charm_rea.py -r --dump-eval-details # modify config file `configs/eval_charm_mem.py`: uncomment or add models you want to evaluate python run.py configs/eval_charm_mem.py -r --dump-eval-details ``` The inference and evaluation results would be in `${path_to_opencompass}/outputs`, like this: ```bash outputs β”œβ”€β”€ CHARM_mem β”‚ └── chat β”‚ └── 20240605_151442 β”‚ β”œβ”€β”€ predictions β”‚ β”‚ β”œβ”€β”€ internlm2-chat-1.8b-turbomind β”‚ β”‚ β”œβ”€β”€ llama-3-8b-instruct-lmdeploy β”‚ β”‚ └── qwen1.5-1.8b-chat-hf β”‚ β”œβ”€β”€ results β”‚ β”‚ β”œβ”€β”€ internlm2-chat-1.8b-turbomind_judged-by--GPT-3.5-turbo-0125 β”‚ β”‚ β”œβ”€β”€ llama-3-8b-instruct-lmdeploy_judged-by--GPT-3.5-turbo-0125 β”‚ β”‚ └── qwen1.5-1.8b-chat-hf_judged-by--GPT-3.5-turbo-0125 β”‚Β Β  └── summary β”‚Β Β  └── 20240605_205020 # MEMORY_SUMMARY_DIR β”‚Β Β  β”œβ”€β”€ judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Anachronisms_Judgment β”‚Β Β  β”œβ”€β”€ judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Movie_and_Music_Recommendation β”‚Β Β  β”œβ”€β”€ judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Sport_Understanding β”‚Β Β  β”œβ”€β”€ judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Time_Understanding β”‚Β Β  └── judged-by--GPT-3.5-turbo-0125.csv # MEMORY_SUMMARY_CSV └── CHARM_rea └── chat └── 20240605_152359 β”œβ”€β”€ predictions β”‚ β”œβ”€β”€ internlm2-chat-1.8b-turbomind β”‚ β”œβ”€β”€ llama-3-8b-instruct-lmdeploy β”‚ └── qwen1.5-1.8b-chat-hf β”œβ”€β”€ results # REASON_RESULTS_DIR β”‚ β”œβ”€β”€ internlm2-chat-1.8b-turbomind β”‚ β”œβ”€β”€ llama-3-8b-instruct-lmdeploy β”‚ └── qwen1.5-1.8b-chat-hf └── summary β”œβ”€β”€ summary_20240605_205328.csv # REASON_SUMMARY_CSV └── summary_20240605_205328.txt ``` ### 3. Generate Analysis Results ```bash cd ${path_to_CHARM_repo} # generate Table5, Table6, Table9 and Table10 in https://arxiv.org/abs/2403.14112 PYTHONPATH=. python tools/summarize_reasoning.py ${REASON_SUMMARY_CSV} # generate Figure3 and Figure9 in https://arxiv.org/abs/2403.14112 PYTHONPATH=. python tools/summarize_mem_rea.py ${REASON_SUMMARY_CSV} ${MEMORY_SUMMARY_CSV} # generate Table7, Table12, Table13 and Figure11 in https://arxiv.org/abs/2403.14112 PYTHONPATH=. python tools/analyze_mem_indep_rea.py data/CHARM ${REASON_RESULTS_DIR} ${MEMORY_SUMMARY_DIR} ${MEMORY_SUMMARY_CSV} ``` ## πŸ–ŠοΈ Citation ```bibtex @misc{sun2024benchmarking, title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations}, author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He}, year={2024}, eprint={2403.14112}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```