mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00

* support CHARM (https://github.com/opendatalab/CHARM) reasoning tasks * fix lint error * add dataset card for CHARM * minor refactor * add txt --------- Co-authored-by: wujiang <wujiang@pjlab.org.cn> Co-authored-by: Leymore <zfz-960727@163.com>
5.3 KiB
5.3 KiB
CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024]
数据集介绍
CHARM 是首个全面深入评估大型语言模型(LLMs)在中文常识推理能力的基准测试,它覆盖了国际普遍认知的常识以及独特的中国文化常识。此外,CHARM 还可以评估 LLMs 独立于记忆的推理能力,并分析其典型错误。
与其他常识推理评测基准的比较
<html lang="en">基准 | 汉语 | 常识推理 | 中国特有知识 | 中国和世界知识域 | 推理和记忆的关系 |
---|---|---|---|---|---|
davis2023benchmarks 中提到的基准 | ✘ | ✔ | ✘ | ✘ | ✘ |
XNLI, XCOPA,XStoryCloze | ✔ | ✔ | ✘ | ✘ | ✘ |
LogiQA,CLUE, CMMLU | ✔ | ✘ | ✔ | ✘ | ✘ |
CORECODE | ✔ | ✔ | ✘ | ✘ | ✘ |
CHARM (ours) | ✔ | ✔ | ✔ | ✔ | ✔ |
🛠️ 如何使用
以下是快速下载 CHARM 并在 OpenCompass 上进行评估的步骤。
1. 下载 CHARM
git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}
2. 推理和评测
cd ${path_to_opencompass}
mkdir -p data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM
# 在CHARM上对模型hf_llama3_8b_instruct做推理和评测
python run.py --models hf_llama3_8b_instruct --datasets charm_gen
🖊️ 引用
@misc{sun2024benchmarking,
title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
year={2024},
eprint={2403.14112},
archivePrefix={arXiv},
primaryClass={cs.CL}
}