Subjective evaluation aims to assess the model's performance in tasks that align with human preferences. The key criterion for this evaluation is human preference, but it comes with a high cost of annotation.
To explore the model's subjective capabilities, we employ JudgeLLM as a substitute for human assessors ([LLM-as-a-Judge](https://arxiv.org/abs/2306.05685)).
For `config/eval_subjective_score.py`, it is mainly same with `config/eval_subjective_compare.py`, and you just need to modify the eval mode to `singlescore`.
3. PandaLM, refer to `configs/models/judge_llm/pandalm`
Consider cite the following paper if you find it helpful:
```bibtex
@article{wang2023pandalm,
title={PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization},
author={Wang, Yidong and Yu, Zhuohao and Zeng, Zhengran and Yang, Linyi and Wang, Cunxiang and Chen, Hao and Jiang, Chaoya and Xie, Rui and Wang, Jindong and Xie, Xing and others},
journal={arXiv preprint arXiv:2306.05087},
year={2023}
}
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
## Multi-round Subjective Evaluation in OpenCompass
In OpenCompass, we also support subjective multi-turn dialogue evaluation. For instance, the evaluation of MT-Bench can be referred to in `configs/eval_subjective_mtbench.py`.
In the multi-turn dialogue evaluation, you need to organize the data format into the following dialogue structure:
```
"dialogue": [
{
"role": "user",
"content": "Imagine you are participating in a race with a group of people. If you have just overtaken the second person, what's your current position? Where is the person you just overtook?"
},
{
"role": "assistant",
"content": ""
},
{
"role": "user",
"content": "If the \"second person\" is changed to \"last person\" in the above question, what would the answer be?"
},
{
"role": "assistant",
"content": ""
}
],
```
It's important to note that due to the different question types in MTBench having different temperature settings, we need to divide the original data files into three different subsets according to the temperature for separate inference. For different subsets, we can set different temperatures. For specific settings, please refer to `configs\datasets\subjective\multiround\mtbench_single_judge_diff_temp.py`.
Consider cite the following paper if you find it helpful:
```bibtex
@misc{zheng2023judging,
title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
year={2023},
eprint={2306.05685},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
@misc{2023opencompass,
title={OpenCompass: A Universal Evaluation Platform for Foundation Models},