OpenCompass/configs/datasets/CHARM/README_ZH.md
jxd 608ff5810d
support CHARM (https://github.com/opendatalab/CHARM) reasoning tasks (#1190)
* support CHARM (https://github.com/opendatalab/CHARM) reasoning tasks

* fix lint error

* add dataset card for CHARM

* minor refactor

* add txt

---------

Co-authored-by: wujiang <wujiang@pjlab.org.cn>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-05-27 13:48:22 +08:00

5.3 KiB
Raw Blame History

CHARM Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024]

arXiv license

数据集介绍

CHARM 是首个全面深入评估大型语言模型LLMs在中文常识推理能力的基准测试它覆盖了国际普遍认知的常识以及独特的中国文化常识。此外CHARM 还可以评估 LLMs 独立于记忆的推理能力,并分析其典型错误。

与其他常识推理评测基准的比较

<html lang="en">
基准 汉语 常识推理 中国特有知识 中国和世界知识域 推理和记忆的关系
davis2023benchmarks 中提到的基准
XNLI, XCOPA,XStoryCloze
LogiQA,CLUE, CMMLU
CORECODE
CHARM (ours)

🛠️ 如何使用

以下是快速下载 CHARM 并在 OpenCompass 上进行评估的步骤。

1. 下载 CHARM

git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}

2. 推理和评测

cd ${path_to_opencompass}
mkdir -p data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM

# 在CHARM上对模型hf_llama3_8b_instruct做推理和评测
python run.py --models hf_llama3_8b_instruct --datasets charm_gen

🖊️ 引用

@misc{sun2024benchmarking,
      title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
      author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
      year={2024},
      eprint={2403.14112},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}