mirror of https://github.com/open-compass/opencompass.git synced 2025-05-30 16:03:24 +08:00

support CHARM (https://github.com/opendatalab/CHARM ) reasoning tasks (#1190 )

* support CHARM (https://github.com/opendatalab/CHARM) reasoning tasks

* fix lint error

* add dataset card for CHARM

* minor refactor

* add txt

---------

Co-authored-by: wujiang <wujiang@pjlab.org.cn>
Co-authored-by: Leymore <zfz-960727@163.com>

2024-05-27 13:48:22 +08:00

5.3 KiB

Raw Blame History

CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024]

📃Paper 🏰Project Page 🏆Leaderboard ✨Findings

📖 中文 | English

数据集介绍

CHARM 是首个全面深入评估大型语言模型（LLMs）在中文常识推理能力的基准测试，它覆盖了国际普遍认知的常识以及独特的中国文化常识。此外，CHARM 还可以评估 LLMs 独立于记忆的推理能力，并分析其典型错误。

与其他常识推理评测基准的比较

基准	汉语	常识推理	中国特有知识	中国和世界知识域	推理和记忆的关系
davis2023benchmarks 中提到的基准	✘	✔	✘	✘	✘
XNLI, XCOPA,XStoryCloze	✔	✔	✘	✘	✘
LogiQA,CLUE, CMMLU	✔	✘	✔	✘	✘
CORECODE	✔	✔	✘	✘	✘
CHARM (ours)	✔	✔	✔	✔	✔

🛠️ 如何使用

以下是快速下载 CHARM 并在 OpenCompass 上进行评估的步骤。

1. 下载 CHARM

git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}

2. 推理和评测

cd ${path_to_opencompass}
mkdir -p data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM

# 在CHARM上对模型hf_llama3_8b_instruct做推理和评测
python run.py --models hf_llama3_8b_instruct --datasets charm_gen

🖊️ 引用

@misc{sun2024benchmarking,
      title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
      author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
      year={2024},
      eprint={2403.14112},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

5.3 KiB Raw Blame History Unescape Escape