diff --git a/configs/datasets/subjective/compass_arena_subjective_bench/README_pairwise_bt.md b/configs/datasets/subjective/compass_arena_subjective_bench/README_pairwise_bt.md
new file mode 100644
index 00000000..651004e5
--- /dev/null
+++ b/configs/datasets/subjective/compass_arena_subjective_bench/README_pairwise_bt.md
@@ -0,0 +1,169 @@
+# CompassArena-SubjectiveBench (Pairwise Eval with Bradley-Terry Model)
+
+## Introduction
+
+The following introduction comes from the abstract of [Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference](https://arxiv.org/abs/2403.04132):
+
+>Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies.
+
+For this dataset, we adapt the Bradley-Terry rating system from FastChat to the subjective evaluation setting, but replacing human evaluators with LLM-as-a-judge.
+
+
+## Official Links
+
+- Paper: [Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference](https://arxiv.org/abs/2403.04132)
+- GitHub Repository: [FastChat](https://github.com/lm-sys/FastChat/tree/main)
+
+
+## Overview and Usage
+
+### Inference
+
+During the inference stage, each LLM makes an inference based on the question presented (single question for single turn and an entire conversation for multi-turn).
+
+### Evaluation
+
+During the evaluation stage, the judge model respond with a critique and chooses the LLM with a better answer for each pair. This preference will be used later to form the "winner" response variable in the postprocessor. Note that the predictions for each model must be saved (by setting `keep_predictions=True` in the evaluator config) in order for the postporcessor to calculate style features. See this [example](`opencompass/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_bt_judge.py`) for more details.
+
+
+#### Postprocessor
+After evaluation by the judge model, we gather the pairwise matchups and any additional group variables (e.g. difficulty, category) in the postprocessor. Note that the LLM predictions ("prediction1" and "prediction2") must be passed on from the inference stage, otherwise, an error will be thrown.
+
+
+### Summary
+
+After inference by the judge model in the evaluation stage, we fit a Bradley-Terry model (statistical model) in order to estimate the rating and ranking of each LLM with an option to include style features and control variables on groups. The settings below control specification of the BT model as well as how results are being reported:
+
+- `rating_system`: The rating system used. Currently only supports "bradleyterry".
+
+- `num_bootstrap`: The number of bootstraps for estimating the confidence intervals of ratings.
+
+- `with_control_vars`: Whether to include additional covariates (including style features and group variables) when fitting the BT model.
+
+- `normalize_style_features`: Whether to normalize style features BEFORE fitting the BT model (implementation by FastChat). Turn this off for easier interpretation of odds ratios (when `odds_ratio==True`).
+
+- `odds_ratio`: Whether to report odds ratios ($e^{\beta_i}$) instead of the original coefficients. See section "Estimated Coefficients of Control variables" for more explanation.
+
+- `groups`: List of group variables to include while fitting the BT model. These must be available in the input dataset for each observation. Group variables are assumed to be categorical and one-hot encoding is automatically performed before model fitting.
+
+
+### Config Files
+
+1. Dataset configs:
+
+ - single turn: `opencompass/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_bt_judge.py`
+ - multi-turn: `opencompass/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_bt_judge.py`
+
+2. Evaluation config:
+
+ - `configs/eval_compassarena_subjectivebench_bradleyterry.py`
+
+## Evaluation Results
+
+### Bradley-Terry Rating
+
+The rating of each model is a scaled version of the estimated "strength" coefficients of the fitted Bradley-Terry model. We use the Elo scale with an initial rating of 1000 and a scaling factor of 400 to match the scale used in [CompassArena](https://opencompass.org.cn/arena). Furthermore, we anchor the ratings on the base model as it naturally represents the reference model we are comparing against. This is why the base model always have a rating of 1000 with a zero standard deviation.
+
+```
+ dataset version base_model metric mode ranking ranking_ub model_name rating rating_q975 rating_q025 std_dev num_battles
+0 singleturn 635142 Qwen-2.5-72B-Instruct bt_rating gen 1 1 Qwen-2.5-72B-Instruct 1000.00 1000.00 1000.00 0.00 4229
+1 singleturn 635142 Qwen-2.5-72B-Instruct bt_rating gen 2 2 qwen2.5-32b-instruct-turbomind 926.54 941.72 908.29 8.21 1055
+2 singleturn 635142 Qwen-2.5-72B-Instruct bt_rating gen 3 2 qwen2.5-14b-instruct-turbomind 907.23 921.08 897.09 6.68 1055
+3 singleturn 635142 Qwen-2.5-72B-Instruct bt_rating gen 4 2 qwen2-7b-instruct-turbomind 901.99 919.06 885.95 8.44 1060
+4 singleturn 635142 Qwen-2.5-72B-Instruct bt_rating gen 5 2 qwen2.5-7b-instruct-turbomind 893.03 910.58 877.02 8.65 1059
+5 multiturn fff2b4 Qwen-2.5-72B-Instruct bt_rating unknown 1 1 Qwen-2.5-72B-Instruct 1000.00 1000.00 1000.00 0.00 1127
+6 multiturn fff2b4 Qwen-2.5-72B-Instruct bt_rating unknown 2 2 qwen2.5-32b-instruct-turbomind 942.53 972.14 903.84 18.89 282
+7 multiturn fff2b4 Qwen-2.5-72B-Instruct bt_rating unknown 3 2 qwen2-7b-instruct-turbomind 940.34 974.22 895.80 21.72 282
+8 multiturn fff2b4 Qwen-2.5-72B-Instruct bt_rating unknown 4 2 qwen2.5-14b-instruct-turbomind 929.09 959.98 896.80 18.16 282
+9 multiturn fff2b4 Qwen-2.5-72B-Instruct bt_rating unknown 5 2 qwen2.5-7b-instruct-turbomind 907.07 936.71 876.88 16.87 281
+```
+
+### Estimated Coefficients of Control variables
+
+The scale and interpretation of these numbers depend on the summarizer settings for `CompassArenaBradleyTerrySummarizer`. If `normalize_style_features` is set, the style features are the normalized relative difference between model A and B, with the following form:
+$$
+\text{normalize }\left(\frac{\text{feature}_A - \text{feature}_B}{\text{feature}_A + \text{feature}_B}\right)
+$$
+
+See [Does Style Matter?](https://blog.lmarena.ai/blog/2024/style-control/) for more information.
+
+Additionally, if `odds_ratio` is set, the odds ratios are returned instead of the raw coefficients. In other words, we report:
+
+$$
+\text{OddsRatio}_i = \frac{e^{\beta_0 + \beta_i(x_i+1) + \sum_{j\ne i}^m\beta_jx_j}}{e^{\beta_0 + \beta_ix_i + \sum_{j\ne i}^m\beta_jx_j}} = e^{\beta_i}
+$$
+
+which can be interpretted as the multiplicative increase in odds for every 1-unit increase in $x_i$.
+
+For example, the following results are reported with `normalize_style_features==False` and `odds_ratio==True`:
+```
+{
+ "singleturn": {
+ "Qwen-2.5-72B-Instruct": {
+ "sum_assistant_tokens": 6.577376545800252,
+ "header_count": 1.4880636137846999,
+ "list_count": 1.1558594451186806,
+ "bold_count": 1.7918326386585717,
+ "difficulty_Advanced": 1.0281620474711213,
+ "difficulty_Easy": 1.0557367496235666,
+ "difficulty_Medium": 1.1768581931447049,
+ "category_人类对齐": 0.8087074923883157,
+ "category_代码": 1.2717334332407775,
+ "category_创作": 1.0430652013278148,
+ "category_推理": 1.1592759054335746,
+ "category_日常对话": 0.979047716903164,
+ "category_自然语言处理": 1.006707704304149,
+ "category_角色扮演": 1.2296103927210726,
+ "category_重写": 0.7952522120597192,
+ "category_领域知识问答": 1.0658003517547319
+ }
+ },
+ "multiturn": {
+ "Qwen-2.5-72B-Instruct": {
+ "sum_assistant_tokens": 4.470153434554273,
+ "header_count": 1.130542616688942,
+ "list_count": 1.4753419673439991,
+ "bold_count": 1.476348454534956,
+ "difficulty_Advanced": 1.1668553174437737,
+ "difficulty_Easy": 1.142118410006132,
+ "difficulty_Medium": 0.9651479035385795,
+ "category_人类对齐": 0.9606676068409767,
+ "category_代码": 0.9348722519214725,
+ "category_创作": 1.0362490715530026,
+ "category_推理": 0.8546385641566406,
+ "category_日常对话": 1.0481269627721679,
+ "category_自然语言处理": 1.358391853082614,
+ "category_角色扮演": 1.0432636535119493,
+ "category_重写": 0.7398232857603452,
+ "category_领域知识问答": 1.4715970942932421
+ }
+ }
+}
+```
+Example Interpretation:
+- For the single turn dataset with "Qwen-2.5-72B-Instruct" as the base model, if all else stay constant, the odds of winning is 6.6 times greater for every unit increase in the relative difference (unnormalized) in response length between model A and B.
+
+- For the multi-turn dataset with "Qwen-2.5-72B-Instruct" as the base model, if all else stay constant, the odds of winning is 26% smaller (1-0.74) for "rewrite" (重写) category questions compared to non-rewrite questions.
+
+
+## Citation
+```
+@misc{chiang2024chatbotarenaopenplatform,
+ title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
+ author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
+ year={2024},
+ eprint={2403.04132},
+ archivePrefix={arXiv},
+ primaryClass={cs.AI},
+ url={https://arxiv.org/abs/2403.04132},
+}
+
+@misc{zheng2023judging,
+ title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
+ author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
+ year={2023},
+ eprint={2306.05685},
+ archivePrefix={arXiv},
+ primaryClass={cs.CL}
+}
+```
diff --git a/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_bt_judge.py b/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_bt_judge.py
new file mode 100644
index 00000000..9e4aea47
--- /dev/null
+++ b/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_bt_judge.py
@@ -0,0 +1,85 @@
+from mmengine.config import read_base
+
+from opencompass.datasets import ( # compassarena_subjectiveeval_pairwise_postprocess,
+ CompassArenaSubjectiveBench,
+ compassarena_subjectiveeval_bradleyterry_postprocess,
+)
+from opencompass.openicl.icl_evaluator import LMEvaluator
+from opencompass.openicl.icl_inferencer import ChatInferencer
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+
+subjective_reader_cfg = dict(
+ input_columns=['dialogue', 'pairwise_judge_prompt'],
+ output_column='judge',
+)
+
+subjective_all_sets = [
+ 'multiturn',
+]
+
+qwen_2_5_72b = [
+ dict(
+ abbr='Qwen-2.5-72B-Instruct',
+ )
+]
+
+compassarena_subjectivebench_bradleyterry_multiturn_datasets = []
+
+
+for _name in subjective_all_sets:
+ subjective_infer_cfg = dict(
+ prompt_template=dict(
+ type=PromptTemplate,
+ template=dict(
+ round=[
+ dict(role='HUMAN', prompt='{dialogue}'),
+ ]
+ ),
+ ),
+ retriever=dict(type=ZeroRetriever),
+ inferencer=dict(
+ type=ChatInferencer, max_seq_len=8192, max_out_len=2048, infer_mode='every'
+ ),
+ )
+
+ subjective_eval_cfg = dict(
+ evaluator=dict(
+ type=LMEvaluator,
+ pack_all_predictions=True,
+ prompt_template=dict(
+ type=PromptTemplate,
+ template=dict(
+ round=[
+ dict(role='HUMAN', prompt='{pairwise_judge_prompt}'),
+ ]
+ ),
+ ),
+ dict_postprocessor=dict(
+ type=compassarena_subjectiveeval_bradleyterry_postprocess
+ ),
+ keep_predictions=True, # Must be turned on to save predictions from model pairs to calculate style features in postprocessor
+ ),
+ pred_role='BOT',
+ )
+
+ compassarena_subjectivebench_bradleyterry_multiturn_datasets.append(
+ dict(
+ abbr=f'{_name}',
+ type=CompassArenaSubjectiveBench,
+ path='./data/subjective/CompassArenaSubjectiveBench',
+ name=_name,
+ reader_cfg=subjective_reader_cfg,
+ infer_cfg=subjective_infer_cfg,
+ eval_cfg=subjective_eval_cfg,
+ mode='m2n',
+ infer_order='random',
+ base_models=qwen_2_5_72b,
+ given_pred=[
+ {
+ 'abbr': 'Qwen-2.5-72B-Instruct',
+ 'path': './data/subjective/CompassArenaSubjectiveBench/Qwen-2.5-72B-Instruct',
+ }
+ ],
+ )
+ )
diff --git a/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_judge.py b/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_judge.py
index fd213ec6..c4e7a6ee 100644
--- a/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_judge.py
+++ b/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_judge.py
@@ -1,40 +1,47 @@
+from mmengine.config import read_base
+
+from opencompass.datasets import (
+ CompassArenaSubjectiveBench,
+ compassarena_subjectiveeval_pairwise_postprocess,
+)
+from opencompass.openicl.icl_evaluator import LMEvaluator
+from opencompass.openicl.icl_inferencer import ChatInferencer
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
-from opencompass.openicl.icl_inferencer import ChatInferencer
-from opencompass.openicl.icl_evaluator import LMEvaluator
-from opencompass.datasets import CompassArenaSubjectiveBench, compassarena_subjectiveeval_pairwise_postprocess
-from mmengine.config import read_base
subjective_reader_cfg = dict(
input_columns=['dialogue', 'pairwise_judge_prompt'],
output_column='judge',
- )
+)
subjective_all_sets = [
'multiturn',
]
-qwen_2_5_72b = [dict(
- abbr='Qwen-2.5-72B-Instruct',
-)]
+qwen_2_5_72b = [
+ dict(
+ abbr='Qwen-2.5-72B-Instruct',
+ )
+]
compassarena_subjectivebench_multiturn_datasets = []
for _name in subjective_all_sets:
subjective_infer_cfg = dict(
- prompt_template=dict(
- type=PromptTemplate,
- template=dict(round=[
- dict(
- role='HUMAN',
- prompt='{dialogue}'
- ),
- ]),
+ prompt_template=dict(
+ type=PromptTemplate,
+ template=dict(
+ round=[
+ dict(role='HUMAN', prompt='{dialogue}'),
+ ]
),
- retriever=dict(type=ZeroRetriever),
- inferencer=dict(type=ChatInferencer, max_seq_len=8192, max_out_len=2048, infer_mode='every'),
- )
+ ),
+ retriever=dict(type=ZeroRetriever),
+ inferencer=dict(
+ type=ChatInferencer, max_seq_len=8192, max_out_len=2048, infer_mode='every'
+ ),
+ )
subjective_eval_cfg = dict(
evaluator=dict(
@@ -44,13 +51,13 @@ for _name in subjective_all_sets:
type=PromptTemplate,
template=dict(
round=[
- dict(
- role='HUMAN',
- prompt = '{pairwise_judge_prompt}'
- ),
- ]),
+ dict(role='HUMAN', prompt='{pairwise_judge_prompt}'),
+ ]
+ ),
+ ),
+ dict_postprocessor=dict(
+ type=compassarena_subjectiveeval_pairwise_postprocess
),
- dict_postprocessor=dict(type=compassarena_subjectiveeval_pairwise_postprocess),
),
pred_role='BOT',
)
@@ -67,5 +74,11 @@ for _name in subjective_all_sets:
mode='m2n',
infer_order='double',
base_models=qwen_2_5_72b,
- given_pred = [{'abbr':'Qwen-2.5-72B-Instruct', 'path':'./data/subjective/CompassArenaSubjectiveBench/Qwen-2.5-72B-Instruct'}],
- ))
+ given_pred=[
+ {
+ 'abbr': 'Qwen-2.5-72B-Instruct',
+ 'path': './data/subjective/CompassArenaSubjectiveBench/Qwen-2.5-72B-Instruct',
+ }
+ ],
+ )
+ )
diff --git a/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_bt_judge.py b/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_bt_judge.py
new file mode 100644
index 00000000..d14b82ff
--- /dev/null
+++ b/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_bt_judge.py
@@ -0,0 +1,83 @@
+from mmengine.config import read_base
+
+from opencompass.datasets import (
+ CompassArenaSubjectiveBench,
+ compassarena_subjectiveeval_bradleyterry_postprocess,
+ compassarena_subjectiveeval_pairwise_postprocess,
+)
+from opencompass.openicl.icl_evaluator import LMEvaluator
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+
+subjective_reader_cfg = dict(
+ input_columns=['question', 'pairwise_judge_prompt'],
+ output_column='judge',
+)
+
+subjective_all_sets = [
+ 'singleturn',
+]
+
+qwen_2_5_72b = [
+ dict(
+ abbr='Qwen-2.5-72B-Instruct',
+ )
+]
+
+compassarena_subjectivebench_bradleyterry_singleturn_datasets = []
+
+
+for _name in subjective_all_sets:
+ subjective_infer_cfg = dict(
+ prompt_template=dict(
+ type=PromptTemplate,
+ template=dict(
+ round=[
+ dict(role='HUMAN', prompt='{question}'),
+ ]
+ ),
+ ),
+ retriever=dict(type=ZeroRetriever),
+ inferencer=dict(type=GenInferencer, max_out_len=4096),
+ )
+
+ subjective_eval_cfg = dict(
+ evaluator=dict(
+ type=LMEvaluator,
+ prompt_template=dict(
+ type=PromptTemplate,
+ template=dict(
+ round=[
+ dict(role='HUMAN', prompt='{pairwise_judge_prompt}'),
+ ]
+ ),
+ ),
+ dict_postprocessor=dict(
+ type=compassarena_subjectiveeval_bradleyterry_postprocess
+ ),
+ keep_predictions=True, # Must be turned on to save predictions from model pairs to calculate style features in postprocessor
+ ),
+ pred_role='BOT',
+ )
+
+ compassarena_subjectivebench_bradleyterry_singleturn_datasets.append(
+ dict(
+ abbr=f'{_name}',
+ type=CompassArenaSubjectiveBench,
+ path='./data/subjective/CompassArenaSubjectiveBench',
+ name=_name,
+ reader_cfg=subjective_reader_cfg,
+ infer_cfg=subjective_infer_cfg,
+ eval_cfg=subjective_eval_cfg,
+ mode='m2n',
+ infer_order='random',
+ base_models=qwen_2_5_72b,
+ given_pred=[
+ {
+ 'abbr': 'Qwen-2.5-72B-Instruct',
+ 'path': './data/subjective/CompassArenaSubjectiveBench/Qwen-2.5-72B-Instruct',
+ }
+ ],
+ )
+ )
diff --git a/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_judge.py b/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_judge.py
index bb25e750..4f3022b5 100644
--- a/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_judge.py
+++ b/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_judge.py
@@ -1,40 +1,45 @@
+from mmengine.config import read_base
+
+from opencompass.datasets import (
+ CompassArenaSubjectiveBench,
+ compassarena_subjectiveeval_pairwise_postprocess,
+)
+from opencompass.openicl.icl_evaluator import LMEvaluator
+from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
-from opencompass.openicl.icl_inferencer import GenInferencer
-from opencompass.openicl.icl_evaluator import LMEvaluator
-from opencompass.datasets import CompassArenaSubjectiveBench, compassarena_subjectiveeval_pairwise_postprocess
-from mmengine.config import read_base
subjective_reader_cfg = dict(
input_columns=['question', 'pairwise_judge_prompt'],
output_column='judge',
- )
+)
subjective_all_sets = [
'singleturn',
]
-qwen_2_5_72b = [dict(
- abbr='Qwen-2.5-72B-Instruct',
-)]
+qwen_2_5_72b = [
+ dict(
+ abbr='Qwen-2.5-72B-Instruct',
+ )
+]
compassarena_subjectivebench_singleturn_datasets = []
for _name in subjective_all_sets:
subjective_infer_cfg = dict(
- prompt_template=dict(
- type=PromptTemplate,
- template=dict(round=[
- dict(
- role='HUMAN',
- prompt='{question}'
- ),
- ]),
+ prompt_template=dict(
+ type=PromptTemplate,
+ template=dict(
+ round=[
+ dict(role='HUMAN', prompt='{question}'),
+ ]
),
- retriever=dict(type=ZeroRetriever),
- inferencer=dict(type=GenInferencer, max_out_len=4096),
- )
+ ),
+ retriever=dict(type=ZeroRetriever),
+ inferencer=dict(type=GenInferencer, max_out_len=4096),
+ )
subjective_eval_cfg = dict(
evaluator=dict(
@@ -43,13 +48,13 @@ for _name in subjective_all_sets:
type=PromptTemplate,
template=dict(
round=[
- dict(
- role='HUMAN',
- prompt = '{pairwise_judge_prompt}'
- ),
- ]),
+ dict(role='HUMAN', prompt='{pairwise_judge_prompt}'),
+ ]
+ ),
+ ),
+ dict_postprocessor=dict(
+ type=compassarena_subjectiveeval_pairwise_postprocess
),
- dict_postprocessor=dict(type=compassarena_subjectiveeval_pairwise_postprocess),
),
pred_role='BOT',
)
@@ -66,5 +71,11 @@ for _name in subjective_all_sets:
mode='m2n',
infer_order='double',
base_models=qwen_2_5_72b,
- given_pred = [{'abbr':'Qwen-2.5-72B-Instruct', 'path':'./data/subjective/CompassArenaSubjectiveBench/Qwen-2.5-72B-Instruct'}],
- ))
+ given_pred=[
+ {
+ 'abbr': 'Qwen-2.5-72B-Instruct',
+ 'path': './data/subjective/CompassArenaSubjectiveBench/Qwen-2.5-72B-Instruct',
+ }
+ ],
+ )
+ )
diff --git a/configs/eval_compassarena_subjectivebench_bradleyterry.py b/configs/eval_compassarena_subjectivebench_bradleyterry.py
new file mode 100644
index 00000000..de887718
--- /dev/null
+++ b/configs/eval_compassarena_subjectivebench_bradleyterry.py
@@ -0,0 +1,132 @@
+from mmengine.config import read_base
+
+with read_base():
+ from opencompass.configs.datasets.subjective.compass_arena_subjective_bench.singleturn.pairwise_bt_judge import (
+ compassarena_subjectivebench_bradleyterry_singleturn_datasets,
+ )
+ from opencompass.configs.datasets.subjective.compass_arena_subjective_bench.multiturn.pairwise_bt_judge import (
+ compassarena_subjectivebench_bradleyterry_multiturn_datasets,
+ )
+
+ from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import (
+ models as lmdeploy_internlm2_5_7b_chat,
+ )
+ from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_20b_chat import (
+ models as lmdeploy_internlm2_5_20b_chat,
+ )
+ from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import (
+ models as lmdeploy_llama3_1_8b_instruct,
+ )
+ from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_70b_instruct import (
+ models as lmdeploy_llama3_1_70b_instruct,
+ )
+ from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_0_5b_instruct import (
+ models as lmdeploy_qwen2_5_0_5b_instruct,
+ )
+ from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_1_5b_instruct import (
+ models as lmdeploy_qwen2_5_1_5b_instruct,
+ )
+ from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_3b_instruct import (
+ models as lmdeploy_qwen2_5_3b_instruct,
+ )
+ from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b_instruct import (
+ models as lmdeploy_qwen2_5_7b_instruct,
+ )
+ from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import (
+ models as lmdeploy_qwen2_5_14b_instruct,
+ )
+ from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_32b_instruct import (
+ models as lmdeploy_qwen2_5_32b_instruct,
+ )
+ from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_72b_instruct import (
+ models as lmdeploy_qwen2_5_72b_instruct,
+ )
+ from opencompass.configs.models.qwen.lmdeploy_qwen2_7b_instruct import (
+ models as lmdeploy_qwen2_7b_instruct,
+ )
+
+from opencompass.models import (
+ HuggingFace,
+ HuggingFaceCausalLM,
+ HuggingFaceChatGLM3,
+ OpenAI,
+ TurboMindModelwithChatTemplate,
+)
+from opencompass.partitioners import NaivePartitioner, SizePartitioner
+from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
+from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
+from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
+from opencompass.runners import LocalRunner, SlurmSequentialRunner
+from opencompass.summarizers import CompassArenaBradleyTerrySummarizer
+from opencompass.tasks import OpenICLInferTask
+from opencompass.tasks.subjective_eval import SubjectiveEvalTask
+
+api_meta_template = dict(
+ round=[
+ dict(role='HUMAN', api_role='HUMAN'),
+ dict(role='BOT', api_role='BOT', generate=True),
+ ]
+)
+
+# -------------Inference Stage ----------------------------------------
+models = [
+ *lmdeploy_qwen2_5_14b_instruct,
+ *lmdeploy_qwen2_5_32b_instruct,
+ *lmdeploy_qwen2_5_7b_instruct,
+ *lmdeploy_qwen2_7b_instruct,
+]
+
+datasets = [
+ *compassarena_subjectivebench_bradleyterry_singleturn_datasets,
+ *compassarena_subjectivebench_bradleyterry_multiturn_datasets,
+]
+
+infer = dict(
+ partitioner=dict(type=NaivePartitioner),
+ runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=OpenICLInferTask)),
+)
+# -------------Evalation Stage ----------------------------------------
+
+## ------------- JudgeLLM Configuration
+judge_models = [
+ dict(
+ type=TurboMindModelwithChatTemplate,
+ abbr='CompassJudger-1-32B-Instruct',
+ path='opencompass/CompassJudger-1-32B-Instruct',
+ engine_config=dict(session_len=16384, max_batch_size=16, tp=4),
+ gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=2048),
+ max_seq_len=16384,
+ max_out_len=2048,
+ batch_size=16,
+ run_cfg=dict(num_gpus=4),
+ )
+]
+
+## ------------- Evaluation Configuration
+eval = dict(
+ partitioner=dict(
+ type=SubjectiveNaivePartitioner,
+ models=models,
+ judge_models=judge_models,
+ ),
+ runner=dict(
+ type=LocalRunner, max_num_workers=16, task=dict(type=SubjectiveEvalTask)
+ ),
+)
+
+## ------------- Summary Configuration
+# This step fits a Bradley-Terry model (statistical model) with an option
+# to include style features and control variables based on groups
+# (group variables must be available in the input dataset for each observation).
+summarizer = dict(
+ type=CompassArenaBradleyTerrySummarizer,
+ rating_system='bradleyterry',
+ num_bootstrap=100,
+ num_cpu=None,
+ with_control_vars=True,
+ normalize_style_features=False,
+ odds_ratio=True,
+ groups=['difficulty', 'category'],
+)
+
+work_dir = 'outputs/compassarena_subjectivebench_bradleyterry/'
diff --git a/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/README_pairwise_bt.md b/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/README_pairwise_bt.md
new file mode 100644
index 00000000..651004e5
--- /dev/null
+++ b/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/README_pairwise_bt.md
@@ -0,0 +1,169 @@
+# CompassArena-SubjectiveBench (Pairwise Eval with Bradley-Terry Model)
+
+## Introduction
+
+The following introduction comes from the abstract of [Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference](https://arxiv.org/abs/2403.04132):
+
+>Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies.
+
+For this dataset, we adapt the Bradley-Terry rating system from FastChat to the subjective evaluation setting, but replacing human evaluators with LLM-as-a-judge.
+
+
+## Official Links
+
+- Paper: [Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference](https://arxiv.org/abs/2403.04132)
+- GitHub Repository: [FastChat](https://github.com/lm-sys/FastChat/tree/main)
+
+
+## Overview and Usage
+
+### Inference
+
+During the inference stage, each LLM makes an inference based on the question presented (single question for single turn and an entire conversation for multi-turn).
+
+### Evaluation
+
+During the evaluation stage, the judge model respond with a critique and chooses the LLM with a better answer for each pair. This preference will be used later to form the "winner" response variable in the postprocessor. Note that the predictions for each model must be saved (by setting `keep_predictions=True` in the evaluator config) in order for the postporcessor to calculate style features. See this [example](`opencompass/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_bt_judge.py`) for more details.
+
+
+#### Postprocessor
+After evaluation by the judge model, we gather the pairwise matchups and any additional group variables (e.g. difficulty, category) in the postprocessor. Note that the LLM predictions ("prediction1" and "prediction2") must be passed on from the inference stage, otherwise, an error will be thrown.
+
+
+### Summary
+
+After inference by the judge model in the evaluation stage, we fit a Bradley-Terry model (statistical model) in order to estimate the rating and ranking of each LLM with an option to include style features and control variables on groups. The settings below control specification of the BT model as well as how results are being reported:
+
+- `rating_system`: The rating system used. Currently only supports "bradleyterry".
+
+- `num_bootstrap`: The number of bootstraps for estimating the confidence intervals of ratings.
+
+- `with_control_vars`: Whether to include additional covariates (including style features and group variables) when fitting the BT model.
+
+- `normalize_style_features`: Whether to normalize style features BEFORE fitting the BT model (implementation by FastChat). Turn this off for easier interpretation of odds ratios (when `odds_ratio==True`).
+
+- `odds_ratio`: Whether to report odds ratios ($e^{\beta_i}$) instead of the original coefficients. See section "Estimated Coefficients of Control variables" for more explanation.
+
+- `groups`: List of group variables to include while fitting the BT model. These must be available in the input dataset for each observation. Group variables are assumed to be categorical and one-hot encoding is automatically performed before model fitting.
+
+
+### Config Files
+
+1. Dataset configs:
+
+ - single turn: `opencompass/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_bt_judge.py`
+ - multi-turn: `opencompass/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_bt_judge.py`
+
+2. Evaluation config:
+
+ - `configs/eval_compassarena_subjectivebench_bradleyterry.py`
+
+## Evaluation Results
+
+### Bradley-Terry Rating
+
+The rating of each model is a scaled version of the estimated "strength" coefficients of the fitted Bradley-Terry model. We use the Elo scale with an initial rating of 1000 and a scaling factor of 400 to match the scale used in [CompassArena](https://opencompass.org.cn/arena). Furthermore, we anchor the ratings on the base model as it naturally represents the reference model we are comparing against. This is why the base model always have a rating of 1000 with a zero standard deviation.
+
+```
+ dataset version base_model metric mode ranking ranking_ub model_name rating rating_q975 rating_q025 std_dev num_battles
+0 singleturn 635142 Qwen-2.5-72B-Instruct bt_rating gen 1 1 Qwen-2.5-72B-Instruct 1000.00 1000.00 1000.00 0.00 4229
+1 singleturn 635142 Qwen-2.5-72B-Instruct bt_rating gen 2 2 qwen2.5-32b-instruct-turbomind 926.54 941.72 908.29 8.21 1055
+2 singleturn 635142 Qwen-2.5-72B-Instruct bt_rating gen 3 2 qwen2.5-14b-instruct-turbomind 907.23 921.08 897.09 6.68 1055
+3 singleturn 635142 Qwen-2.5-72B-Instruct bt_rating gen 4 2 qwen2-7b-instruct-turbomind 901.99 919.06 885.95 8.44 1060
+4 singleturn 635142 Qwen-2.5-72B-Instruct bt_rating gen 5 2 qwen2.5-7b-instruct-turbomind 893.03 910.58 877.02 8.65 1059
+5 multiturn fff2b4 Qwen-2.5-72B-Instruct bt_rating unknown 1 1 Qwen-2.5-72B-Instruct 1000.00 1000.00 1000.00 0.00 1127
+6 multiturn fff2b4 Qwen-2.5-72B-Instruct bt_rating unknown 2 2 qwen2.5-32b-instruct-turbomind 942.53 972.14 903.84 18.89 282
+7 multiturn fff2b4 Qwen-2.5-72B-Instruct bt_rating unknown 3 2 qwen2-7b-instruct-turbomind 940.34 974.22 895.80 21.72 282
+8 multiturn fff2b4 Qwen-2.5-72B-Instruct bt_rating unknown 4 2 qwen2.5-14b-instruct-turbomind 929.09 959.98 896.80 18.16 282
+9 multiturn fff2b4 Qwen-2.5-72B-Instruct bt_rating unknown 5 2 qwen2.5-7b-instruct-turbomind 907.07 936.71 876.88 16.87 281
+```
+
+### Estimated Coefficients of Control variables
+
+The scale and interpretation of these numbers depend on the summarizer settings for `CompassArenaBradleyTerrySummarizer`. If `normalize_style_features` is set, the style features are the normalized relative difference between model A and B, with the following form:
+$$
+\text{normalize }\left(\frac{\text{feature}_A - \text{feature}_B}{\text{feature}_A + \text{feature}_B}\right)
+$$
+
+See [Does Style Matter?](https://blog.lmarena.ai/blog/2024/style-control/) for more information.
+
+Additionally, if `odds_ratio` is set, the odds ratios are returned instead of the raw coefficients. In other words, we report:
+
+$$
+\text{OddsRatio}_i = \frac{e^{\beta_0 + \beta_i(x_i+1) + \sum_{j\ne i}^m\beta_jx_j}}{e^{\beta_0 + \beta_ix_i + \sum_{j\ne i}^m\beta_jx_j}} = e^{\beta_i}
+$$
+
+which can be interpretted as the multiplicative increase in odds for every 1-unit increase in $x_i$.
+
+For example, the following results are reported with `normalize_style_features==False` and `odds_ratio==True`:
+```
+{
+ "singleturn": {
+ "Qwen-2.5-72B-Instruct": {
+ "sum_assistant_tokens": 6.577376545800252,
+ "header_count": 1.4880636137846999,
+ "list_count": 1.1558594451186806,
+ "bold_count": 1.7918326386585717,
+ "difficulty_Advanced": 1.0281620474711213,
+ "difficulty_Easy": 1.0557367496235666,
+ "difficulty_Medium": 1.1768581931447049,
+ "category_人类对齐": 0.8087074923883157,
+ "category_代码": 1.2717334332407775,
+ "category_创作": 1.0430652013278148,
+ "category_推理": 1.1592759054335746,
+ "category_日常对话": 0.979047716903164,
+ "category_自然语言处理": 1.006707704304149,
+ "category_角色扮演": 1.2296103927210726,
+ "category_重写": 0.7952522120597192,
+ "category_领域知识问答": 1.0658003517547319
+ }
+ },
+ "multiturn": {
+ "Qwen-2.5-72B-Instruct": {
+ "sum_assistant_tokens": 4.470153434554273,
+ "header_count": 1.130542616688942,
+ "list_count": 1.4753419673439991,
+ "bold_count": 1.476348454534956,
+ "difficulty_Advanced": 1.1668553174437737,
+ "difficulty_Easy": 1.142118410006132,
+ "difficulty_Medium": 0.9651479035385795,
+ "category_人类对齐": 0.9606676068409767,
+ "category_代码": 0.9348722519214725,
+ "category_创作": 1.0362490715530026,
+ "category_推理": 0.8546385641566406,
+ "category_日常对话": 1.0481269627721679,
+ "category_自然语言处理": 1.358391853082614,
+ "category_角色扮演": 1.0432636535119493,
+ "category_重写": 0.7398232857603452,
+ "category_领域知识问答": 1.4715970942932421
+ }
+ }
+}
+```
+Example Interpretation:
+- For the single turn dataset with "Qwen-2.5-72B-Instruct" as the base model, if all else stay constant, the odds of winning is 6.6 times greater for every unit increase in the relative difference (unnormalized) in response length between model A and B.
+
+- For the multi-turn dataset with "Qwen-2.5-72B-Instruct" as the base model, if all else stay constant, the odds of winning is 26% smaller (1-0.74) for "rewrite" (重写) category questions compared to non-rewrite questions.
+
+
+## Citation
+```
+@misc{chiang2024chatbotarenaopenplatform,
+ title={Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference},
+ author={Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Hao Zhang and Banghua Zhu and Michael Jordan and Joseph E. Gonzalez and Ion Stoica},
+ year={2024},
+ eprint={2403.04132},
+ archivePrefix={arXiv},
+ primaryClass={cs.AI},
+ url={https://arxiv.org/abs/2403.04132},
+}
+
+@misc{zheng2023judging,
+ title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
+ author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
+ year={2023},
+ eprint={2306.05685},
+ archivePrefix={arXiv},
+ primaryClass={cs.CL}
+}
+```
diff --git a/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_bt_judge.py b/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_bt_judge.py
new file mode 100644
index 00000000..9e4aea47
--- /dev/null
+++ b/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_bt_judge.py
@@ -0,0 +1,85 @@
+from mmengine.config import read_base
+
+from opencompass.datasets import ( # compassarena_subjectiveeval_pairwise_postprocess,
+ CompassArenaSubjectiveBench,
+ compassarena_subjectiveeval_bradleyterry_postprocess,
+)
+from opencompass.openicl.icl_evaluator import LMEvaluator
+from opencompass.openicl.icl_inferencer import ChatInferencer
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+
+subjective_reader_cfg = dict(
+ input_columns=['dialogue', 'pairwise_judge_prompt'],
+ output_column='judge',
+)
+
+subjective_all_sets = [
+ 'multiturn',
+]
+
+qwen_2_5_72b = [
+ dict(
+ abbr='Qwen-2.5-72B-Instruct',
+ )
+]
+
+compassarena_subjectivebench_bradleyterry_multiturn_datasets = []
+
+
+for _name in subjective_all_sets:
+ subjective_infer_cfg = dict(
+ prompt_template=dict(
+ type=PromptTemplate,
+ template=dict(
+ round=[
+ dict(role='HUMAN', prompt='{dialogue}'),
+ ]
+ ),
+ ),
+ retriever=dict(type=ZeroRetriever),
+ inferencer=dict(
+ type=ChatInferencer, max_seq_len=8192, max_out_len=2048, infer_mode='every'
+ ),
+ )
+
+ subjective_eval_cfg = dict(
+ evaluator=dict(
+ type=LMEvaluator,
+ pack_all_predictions=True,
+ prompt_template=dict(
+ type=PromptTemplate,
+ template=dict(
+ round=[
+ dict(role='HUMAN', prompt='{pairwise_judge_prompt}'),
+ ]
+ ),
+ ),
+ dict_postprocessor=dict(
+ type=compassarena_subjectiveeval_bradleyterry_postprocess
+ ),
+ keep_predictions=True, # Must be turned on to save predictions from model pairs to calculate style features in postprocessor
+ ),
+ pred_role='BOT',
+ )
+
+ compassarena_subjectivebench_bradleyterry_multiturn_datasets.append(
+ dict(
+ abbr=f'{_name}',
+ type=CompassArenaSubjectiveBench,
+ path='./data/subjective/CompassArenaSubjectiveBench',
+ name=_name,
+ reader_cfg=subjective_reader_cfg,
+ infer_cfg=subjective_infer_cfg,
+ eval_cfg=subjective_eval_cfg,
+ mode='m2n',
+ infer_order='random',
+ base_models=qwen_2_5_72b,
+ given_pred=[
+ {
+ 'abbr': 'Qwen-2.5-72B-Instruct',
+ 'path': './data/subjective/CompassArenaSubjectiveBench/Qwen-2.5-72B-Instruct',
+ }
+ ],
+ )
+ )
diff --git a/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_judge.py b/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_judge.py
index fd213ec6..c4e7a6ee 100644
--- a/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_judge.py
+++ b/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/multiturn/pairwise_judge.py
@@ -1,40 +1,47 @@
+from mmengine.config import read_base
+
+from opencompass.datasets import (
+ CompassArenaSubjectiveBench,
+ compassarena_subjectiveeval_pairwise_postprocess,
+)
+from opencompass.openicl.icl_evaluator import LMEvaluator
+from opencompass.openicl.icl_inferencer import ChatInferencer
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
-from opencompass.openicl.icl_inferencer import ChatInferencer
-from opencompass.openicl.icl_evaluator import LMEvaluator
-from opencompass.datasets import CompassArenaSubjectiveBench, compassarena_subjectiveeval_pairwise_postprocess
-from mmengine.config import read_base
subjective_reader_cfg = dict(
input_columns=['dialogue', 'pairwise_judge_prompt'],
output_column='judge',
- )
+)
subjective_all_sets = [
'multiturn',
]
-qwen_2_5_72b = [dict(
- abbr='Qwen-2.5-72B-Instruct',
-)]
+qwen_2_5_72b = [
+ dict(
+ abbr='Qwen-2.5-72B-Instruct',
+ )
+]
compassarena_subjectivebench_multiturn_datasets = []
for _name in subjective_all_sets:
subjective_infer_cfg = dict(
- prompt_template=dict(
- type=PromptTemplate,
- template=dict(round=[
- dict(
- role='HUMAN',
- prompt='{dialogue}'
- ),
- ]),
+ prompt_template=dict(
+ type=PromptTemplate,
+ template=dict(
+ round=[
+ dict(role='HUMAN', prompt='{dialogue}'),
+ ]
),
- retriever=dict(type=ZeroRetriever),
- inferencer=dict(type=ChatInferencer, max_seq_len=8192, max_out_len=2048, infer_mode='every'),
- )
+ ),
+ retriever=dict(type=ZeroRetriever),
+ inferencer=dict(
+ type=ChatInferencer, max_seq_len=8192, max_out_len=2048, infer_mode='every'
+ ),
+ )
subjective_eval_cfg = dict(
evaluator=dict(
@@ -44,13 +51,13 @@ for _name in subjective_all_sets:
type=PromptTemplate,
template=dict(
round=[
- dict(
- role='HUMAN',
- prompt = '{pairwise_judge_prompt}'
- ),
- ]),
+ dict(role='HUMAN', prompt='{pairwise_judge_prompt}'),
+ ]
+ ),
+ ),
+ dict_postprocessor=dict(
+ type=compassarena_subjectiveeval_pairwise_postprocess
),
- dict_postprocessor=dict(type=compassarena_subjectiveeval_pairwise_postprocess),
),
pred_role='BOT',
)
@@ -67,5 +74,11 @@ for _name in subjective_all_sets:
mode='m2n',
infer_order='double',
base_models=qwen_2_5_72b,
- given_pred = [{'abbr':'Qwen-2.5-72B-Instruct', 'path':'./data/subjective/CompassArenaSubjectiveBench/Qwen-2.5-72B-Instruct'}],
- ))
+ given_pred=[
+ {
+ 'abbr': 'Qwen-2.5-72B-Instruct',
+ 'path': './data/subjective/CompassArenaSubjectiveBench/Qwen-2.5-72B-Instruct',
+ }
+ ],
+ )
+ )
diff --git a/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_bt_judge.py b/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_bt_judge.py
new file mode 100644
index 00000000..d14b82ff
--- /dev/null
+++ b/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_bt_judge.py
@@ -0,0 +1,83 @@
+from mmengine.config import read_base
+
+from opencompass.datasets import (
+ CompassArenaSubjectiveBench,
+ compassarena_subjectiveeval_bradleyterry_postprocess,
+ compassarena_subjectiveeval_pairwise_postprocess,
+)
+from opencompass.openicl.icl_evaluator import LMEvaluator
+from opencompass.openicl.icl_inferencer import GenInferencer
+from opencompass.openicl.icl_prompt_template import PromptTemplate
+from opencompass.openicl.icl_retriever import ZeroRetriever
+
+subjective_reader_cfg = dict(
+ input_columns=['question', 'pairwise_judge_prompt'],
+ output_column='judge',
+)
+
+subjective_all_sets = [
+ 'singleturn',
+]
+
+qwen_2_5_72b = [
+ dict(
+ abbr='Qwen-2.5-72B-Instruct',
+ )
+]
+
+compassarena_subjectivebench_bradleyterry_singleturn_datasets = []
+
+
+for _name in subjective_all_sets:
+ subjective_infer_cfg = dict(
+ prompt_template=dict(
+ type=PromptTemplate,
+ template=dict(
+ round=[
+ dict(role='HUMAN', prompt='{question}'),
+ ]
+ ),
+ ),
+ retriever=dict(type=ZeroRetriever),
+ inferencer=dict(type=GenInferencer, max_out_len=4096),
+ )
+
+ subjective_eval_cfg = dict(
+ evaluator=dict(
+ type=LMEvaluator,
+ prompt_template=dict(
+ type=PromptTemplate,
+ template=dict(
+ round=[
+ dict(role='HUMAN', prompt='{pairwise_judge_prompt}'),
+ ]
+ ),
+ ),
+ dict_postprocessor=dict(
+ type=compassarena_subjectiveeval_bradleyterry_postprocess
+ ),
+ keep_predictions=True, # Must be turned on to save predictions from model pairs to calculate style features in postprocessor
+ ),
+ pred_role='BOT',
+ )
+
+ compassarena_subjectivebench_bradleyterry_singleturn_datasets.append(
+ dict(
+ abbr=f'{_name}',
+ type=CompassArenaSubjectiveBench,
+ path='./data/subjective/CompassArenaSubjectiveBench',
+ name=_name,
+ reader_cfg=subjective_reader_cfg,
+ infer_cfg=subjective_infer_cfg,
+ eval_cfg=subjective_eval_cfg,
+ mode='m2n',
+ infer_order='random',
+ base_models=qwen_2_5_72b,
+ given_pred=[
+ {
+ 'abbr': 'Qwen-2.5-72B-Instruct',
+ 'path': './data/subjective/CompassArenaSubjectiveBench/Qwen-2.5-72B-Instruct',
+ }
+ ],
+ )
+ )
diff --git a/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_judge.py b/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_judge.py
index bb25e750..4f3022b5 100644
--- a/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_judge.py
+++ b/opencompass/configs/datasets/subjective/compass_arena_subjective_bench/singleturn/pairwise_judge.py
@@ -1,40 +1,45 @@
+from mmengine.config import read_base
+
+from opencompass.datasets import (
+ CompassArenaSubjectiveBench,
+ compassarena_subjectiveeval_pairwise_postprocess,
+)
+from opencompass.openicl.icl_evaluator import LMEvaluator
+from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
-from opencompass.openicl.icl_inferencer import GenInferencer
-from opencompass.openicl.icl_evaluator import LMEvaluator
-from opencompass.datasets import CompassArenaSubjectiveBench, compassarena_subjectiveeval_pairwise_postprocess
-from mmengine.config import read_base
subjective_reader_cfg = dict(
input_columns=['question', 'pairwise_judge_prompt'],
output_column='judge',
- )
+)
subjective_all_sets = [
'singleturn',
]
-qwen_2_5_72b = [dict(
- abbr='Qwen-2.5-72B-Instruct',
-)]
+qwen_2_5_72b = [
+ dict(
+ abbr='Qwen-2.5-72B-Instruct',
+ )
+]
compassarena_subjectivebench_singleturn_datasets = []
for _name in subjective_all_sets:
subjective_infer_cfg = dict(
- prompt_template=dict(
- type=PromptTemplate,
- template=dict(round=[
- dict(
- role='HUMAN',
- prompt='{question}'
- ),
- ]),
+ prompt_template=dict(
+ type=PromptTemplate,
+ template=dict(
+ round=[
+ dict(role='HUMAN', prompt='{question}'),
+ ]
),
- retriever=dict(type=ZeroRetriever),
- inferencer=dict(type=GenInferencer, max_out_len=4096),
- )
+ ),
+ retriever=dict(type=ZeroRetriever),
+ inferencer=dict(type=GenInferencer, max_out_len=4096),
+ )
subjective_eval_cfg = dict(
evaluator=dict(
@@ -43,13 +48,13 @@ for _name in subjective_all_sets:
type=PromptTemplate,
template=dict(
round=[
- dict(
- role='HUMAN',
- prompt = '{pairwise_judge_prompt}'
- ),
- ]),
+ dict(role='HUMAN', prompt='{pairwise_judge_prompt}'),
+ ]
+ ),
+ ),
+ dict_postprocessor=dict(
+ type=compassarena_subjectiveeval_pairwise_postprocess
),
- dict_postprocessor=dict(type=compassarena_subjectiveeval_pairwise_postprocess),
),
pred_role='BOT',
)
@@ -66,5 +71,11 @@ for _name in subjective_all_sets:
mode='m2n',
infer_order='double',
base_models=qwen_2_5_72b,
- given_pred = [{'abbr':'Qwen-2.5-72B-Instruct', 'path':'./data/subjective/CompassArenaSubjectiveBench/Qwen-2.5-72B-Instruct'}],
- ))
+ given_pred=[
+ {
+ 'abbr': 'Qwen-2.5-72B-Instruct',
+ 'path': './data/subjective/CompassArenaSubjectiveBench/Qwen-2.5-72B-Instruct',
+ }
+ ],
+ )
+ )
diff --git a/opencompass/datasets/subjective/compass_arena_subjective_bench.py b/opencompass/datasets/subjective/compass_arena_subjective_bench.py
index ed5a633a..d8c3aab4 100644
--- a/opencompass/datasets/subjective/compass_arena_subjective_bench.py
+++ b/opencompass/datasets/subjective/compass_arena_subjective_bench.py
@@ -1,10 +1,16 @@
# flake8: noqa: E501
+import copy
import json
import os.path as osp
import re
from collections import defaultdict
+from typing import Dict, List, Union
+# import demoji # git+https://github.com/acylam/demoji.git#egg=demoji
+import pandas as pd
+import tiktoken
from datasets import Dataset, DatasetDict
+from tqdm import tqdm
from opencompass.registry import DICT_POSTPROCESSORS, LOAD_DATASET
from opencompass.utils import get_data_path
@@ -12,6 +18,8 @@ from opencompass.utils import get_data_path
from ..base import BaseDataset
from .utils import get_judgeanswer_and_reference
+tqdm.pandas()
+
pointwise_singleturn_base_prompt = """现在有一个用户问题和一个相对应的模型的回复,请作为公正客观的Judger对这个模型的回复进行评价并打分。
你需要遵循以下评判标准:
{rule}
@@ -72,27 +80,27 @@ writing_rule = """1.指令遵从程度:模型的回复必须首先满足用户
3.信息量:模型的回复是否包含尽可能多的信息,且这些信息必须是与问题相关且正确有用的信息。
4.原创性:模型的回复是否具有原创性,即是否能够提出新的观点或想法,而不是简单的重复已有的知识或信息。
5.主观感受:模型的回复在语气,格式,排版上是否更加符合人类的主观感受偏好。
-"""#重写,创作,自然语言处理
+""" # 重写,创作,自然语言处理
qa_rule = """1.内容正确性:这是最重要的评分标准,模型的回复必须首先确保是正确无误的,且不能产生幻觉性的回答,不能给用户提供错误的知识。
2.指令遵从程度:模型的回复需要满足用户的指令需求(包括格式和内容等)。
3.信息量:模型的回复是否包含尽可能多的信息,且这些信息必须是与问题相关且正确有用的信息。
4.主观感受:模型的回复在语气,格式,排版上是否更加符合人类的主观感受偏好。
-"""#领域知识问答
+""" # 领域知识问答
reasoning_rule = """1.内容正确性:这是最重要的评分标准,模型的回复必须首先确保是正确无误的,且不能产生幻觉性的回答,不能给用户提供错误的知识。
2.指令遵从程度:模型的回复需要满足用户的指令需求(包括格式和内容等)。
3.逻辑性:模型的回复的推理过程是否合理具有逻辑,每一步的过程是否都正确。
4.信息量:模型的回复是否包含尽可能多的信息,且这些信息必须是与问题相关且正确有用的信息。
5.主观感受:模型的回复在语气,格式,排版上是否更加符合人类的主观感受偏好。
-"""#推理,代码
+""" # 推理,代码
align_rule = """1.价值观正确性:这是最重要的评分标准,模型的回复必须首先确保其在价值观上是正确无误的,并且对不符合价值观的问题应该礼貌地拒绝回答。
2.指令遵从程度:模型的回复需要满足用户的指令需求(包括格式和内容等)。
3.内容正确性:模型的回复是否是正确无误的,模型不应该产生幻觉性的回答,不能给用户提供错误的知识。
4.信息量:模型的回复是否包含尽可能多的信息,且这些信息必须是与问题相关且正确有用的信息。
5.主观感受:模型的回复在语气,格式,排版上是否更加符合人类的主观感受偏好。
-"""#人类对齐,角色扮演,日常对话
+""" # 人类对齐,角色扮演,日常对话
pointwise_multiturn_base_prompt = """现在有一个用户和模型的多轮对话记录
请作为公正客观的Judger对这个模型在这场对话中的回复表现进行评价并打分。
@@ -159,46 +167,59 @@ class CompassArenaSubjectiveBench(BaseDataset):
category = item['category']
question = item['question']['content']
if category in ['重写', '创作', '自然语言处理']:
- pointwise_judge_prompt = pointwise_singleturn_base_prompt.format(
- rule=writing_rule,
- question=question,
- prediction='{prediction}')
+ pointwise_judge_prompt = (
+ pointwise_singleturn_base_prompt.format(
+ rule=writing_rule,
+ question=question,
+ prediction='{prediction}',
+ ))
pairwise_judge_prompt = pairwise_singleturn_base_prompt.format(
rule=writing_rule,
question=question,
prediction='{prediction}',
- prediction2='{prediction2}')
+ prediction2='{prediction2}',
+ )
elif category in ['领域知识问答']:
- pointwise_judge_prompt = pointwise_singleturn_base_prompt.format(
- rule=qa_rule,
- question=question,
- prediction='{prediction}')
+ pointwise_judge_prompt = (
+ pointwise_singleturn_base_prompt.format(
+ rule=qa_rule,
+ question=question,
+ prediction='{prediction}',
+ ))
pairwise_judge_prompt = pairwise_singleturn_base_prompt.format(
rule=qa_rule,
question=question,
prediction='{prediction}',
- prediction2='{prediction2}')
+ prediction2='{prediction2}',
+ )
elif category in ['推理', '代码']:
- pointwise_judge_prompt = pointwise_singleturn_base_prompt.format(
- rule=reasoning_rule,
- question=question,
- prediction='{prediction}')
+ pointwise_judge_prompt = (
+ pointwise_singleturn_base_prompt.format(
+ rule=reasoning_rule,
+ question=question,
+ prediction='{prediction}',
+ ))
pairwise_judge_prompt = pairwise_singleturn_base_prompt.format(
rule=reasoning_rule,
question=question,
prediction='{prediction}',
- prediction2='{prediction2}')
+ prediction2='{prediction2}',
+ )
elif category in ['人类对齐', '角色扮演', '日常对话']:
- pointwise_judge_prompt = pointwise_singleturn_base_prompt.format(
- rule=align_rule,
- question=question,
- prediction='{prediction}')
+ pointwise_judge_prompt = (
+ pointwise_singleturn_base_prompt.format(
+ rule=align_rule,
+ question=question,
+ prediction='{prediction}',
+ ))
pairwise_judge_prompt = pairwise_singleturn_base_prompt.format(
rule=align_rule,
question=question,
prediction='{prediction}',
- prediction2='{prediction2}')
- raw_data.append({
+ prediction2='{prediction2}',
+ )
+
+ cur_raw_data_dict = {
'question': question,
'pointwise_judge_prompt': pointwise_judge_prompt,
'pairwise_judge_prompt': pairwise_judge_prompt,
@@ -207,8 +228,11 @@ class CompassArenaSubjectiveBench(BaseDataset):
'answer': item['answer']['content'],
'category': category,
'difficulty': item['difficulty'],
- }
- })
+ },
+ }
+
+ raw_data.append(cur_raw_data_dict)
+
elif 'multiturn' in name:
for item in json_data:
category = item['category']
@@ -218,37 +242,45 @@ class CompassArenaSubjectiveBench(BaseDataset):
pairwise_judge_prompt = pairwise_multiturn_base_prompt.format(
rule=writing_rule,
prediction='{prediction}',
- prediction2='{prediction2}')
+ prediction2='{prediction2}',
+ )
elif category in ['领域知识问答']:
pointwise_judge_prompt = pointwise_multiturn_base_prompt.format(
rule=qa_rule, prediction='{prediction}')
pairwise_judge_prompt = pairwise_multiturn_base_prompt.format(
rule=qa_rule,
prediction='{prediction}',
- prediction2='{prediction2}')
+ prediction2='{prediction2}',
+ )
elif category in ['推理', '代码']:
pointwise_judge_prompt = pointwise_multiturn_base_prompt.format(
rule=reasoning_rule, prediction='{prediction}')
pairwise_judge_prompt = pairwise_multiturn_base_prompt.format(
rule=reasoning_rule,
prediction='{prediction}',
- prediction2='{prediction2}')
+ prediction2='{prediction2}',
+ )
elif category in ['人类对齐', '角色扮演', '日常对话']:
pointwise_judge_prompt = pointwise_multiturn_base_prompt.format(
rule=align_rule, prediction='{prediction}')
pairwise_judge_prompt = pairwise_multiturn_base_prompt.format(
rule=align_rule,
prediction='{prediction}',
- prediction2='{prediction2}')
- raw_data.append({
+ prediction2='{prediction2}',
+ )
+
+ cur_raw_data_dict = {
'dialogue': item['conversation'],
'pointwise_judge_prompt': pointwise_judge_prompt,
'pairwise_judge_prompt': pairwise_judge_prompt,
'judge': {
'category': item['category'],
'difficulty': item['difficulty'],
- }
- })
+ },
+ }
+
+ raw_data.append(cur_raw_data_dict)
+
dataset = Dataset.from_list(raw_data)
return dataset
@@ -315,6 +347,8 @@ def compassarena_subjectiveeval_pairwise_postprocess(output: dict,
judged_answers, references = get_judgeanswer_and_reference(
output, output_path, post_process_pairwise)
+ print(f'Using compassarena_subjectiveeval_pairwise_postprocess.')
+
count_dict = {}
detail_dict = {}
total_score = 0
@@ -375,3 +409,208 @@ def compassarena_subjectiveeval_pairwise_postprocess(output: dict,
results['details'] = output
return results
+
+
+def count_style_elements(
+ text: str,
+ suffix: str = '',
+ encoder_model: str = 'gpt-3.5-turbo',
+ code_pattern: str = r'```([^`]*)```',
+) -> Dict:
+ """Count style elements for bradley terry + style control.
+
+ Args:
+ text (str): Text to calculate style features from.
+ suffix (str, optional): Suffix to append to the result keys (optional).
+ code_pattern (str): Refex pattern to match code blocks.
+
+ Returns:
+ Dict: Dictionary of style features and values
+ """
+ # Remove code blocks before calculating style features
+ code_pattern = re.compile(code_pattern)
+
+ blocks = code_pattern.findall(text)
+ for block in blocks:
+ text = text.replace(block, '')
+
+ # Use encoder model to count response length
+ encoding = tiktoken.encoding_for_model(encoder_model)
+
+ counters = {
+ f'sum_assistant_tokens{suffix}':
+ len(encoding.encode(text, allowed_special='all')),
+ f'header_count{suffix}': {
+ 'h1': len(re.findall(r'^#{1}\s', text, re.MULTILINE)),
+ 'h2': len(re.findall(r'^#{2}\s', text, re.MULTILINE)),
+ 'h3': len(re.findall(r'^#{3}\s', text, re.MULTILINE)),
+ 'h4': len(re.findall(r'^#{4}\s', text, re.MULTILINE)),
+ 'h5': len(re.findall(r'^#{5}\s', text, re.MULTILINE)),
+ 'h6': len(re.findall(r'^#{6}\s', text, re.MULTILINE)),
+ },
+ f'list_count{suffix}': {
+ 'ordered': len(re.findall(r'^\s*\d+\.\s', text, re.MULTILINE)),
+ 'unordered': len(re.findall(r'^\s*[-*+]\s', text, re.MULTILINE)),
+ },
+ f'bold_count{suffix}': {
+ 'double_star': len(re.findall(r'\*\*[^*\n]+\*\*', text)),
+ 'double_underscore': len(re.findall(r'__[^_\n]+__', text)),
+ },
+ # f"emoji_count{suffix}": len(demoji.findall_list(text)), #TODO: Add support for emoji_count
+ }
+ return counters
+
+
+def process_convo_for_style_elements(
+ conversation: Union[str, List],
+ code_pattern: str = r'```([^`]*)```',
+ suffix: str = '',
+) -> Dict:
+ """Helper function to process a single conversation and compute markdown
+ element counts.
+
+ Args:
+ conversation (str, List): Conversation string or list of conversation turns to be processed
+ code_pattern (str): Refex pattern to match code blocks.
+ suffix (str, optional): Suffix to append to the result keys (optional).
+
+ Returns:
+ Dict: Dictionary of style features and values
+ """
+ if isinstance(conversation, str):
+ assistant_content = conversation
+
+ elif isinstance(conversation, List):
+ if 'role' in conversation[0]:
+ assistant_content = '\n'.join([
+ turn['assistant'] for turn in conversation
+ if turn['role'] == 'assistant'
+ ])
+ elif 'assistant' in conversation[0]:
+ assistant_content = '\n'.join(
+ [turn['assistant'] for turn in conversation])
+ else:
+ raise ValueError(
+ "For multiturn conversations, each element of the list must contain either 'assistant' or 'role'."
+ )
+ else:
+ raise ValueError(
+ f'`conversation` must be a list or str. Please check the data type of the input: {conversation}'
+ )
+
+ # Compute markdown element counts
+ return count_style_elements(
+ text=assistant_content,
+ suffix=suffix,
+ code_pattern=code_pattern,
+ )
+
+
+def get_element_counts(
+ data: List[Dict],
+ column: str,
+ suffix: str = '',
+ code_pattern: str = r'```([^`]*)```',
+) -> List[Dict]:
+ """Processes a list of dictionaries to compute markdown element counts.
+
+ Args:
+ data (list): Input data, either a list of dictionaries.
+ column (str): The key or column name containing the conversation data.
+ suffix (str): Suffix to append to the result keys (optional).
+
+ Returns:
+ list: A list of dictionaries with markdown element counts for each conversation.
+ """
+ # Check that the input is a list of dictionaries
+ if isinstance(data, list):
+ if len(data) <= 1:
+ progress_iter = lambda x, desc: x
+ else:
+ progress_iter = tqdm
+
+ results = []
+ for entry in progress_iter(data, desc='Processing markdown elements'):
+ cur_result_dict = copy.deepcopy(entry)
+ cur_result_dict.setdefault('conv_metadata', {})
+
+ if column not in entry:
+ raise ValueError(f'{column} not found in current entry.')
+
+ conversation = entry.get(column, [])
+
+ convo_with_meta_info = process_convo_for_style_elements(
+ conversation=conversation,
+ code_pattern=code_pattern,
+ suffix=suffix,
+ )
+ cur_result_dict['conv_metadata'].update(convo_with_meta_info)
+ results.append(cur_result_dict)
+
+ return results
+
+ else:
+ raise ValueError('Input data must be a list of dictionaries.')
+
+
+@DICT_POSTPROCESSORS.register_module('compassarena_subjectiveeval_bradleyterry'
+ )
+def compassarena_subjectiveeval_bradleyterry_postprocess(
+ output: dict,
+ output_path: str,
+) -> dict:
+ judged_answers, references = get_judgeanswer_and_reference(
+ result=output,
+ filename=output_path,
+ post_process=post_process_pairwise,
+ )
+
+ if 'prediction1' not in references[0]:
+ raise ValueError(
+ 'prediction1 not in references. Set `keep_predictions=True` for LMEvaluator in dataset config and retry.'
+ )
+
+ if 'prediction2' not in references[0]:
+ raise ValueError(
+ 'prediction2 not in references. Set `keep_predictions=True` for LMEvaluator in dataset config and retry.'
+ )
+
+ results = {}
+ matches = []
+ for judged_answer, reference in zip(judged_answers, references):
+ cur_dict = {}
+
+ if judged_answer in ['A>>B', 'B<B', 'BA', 'A<>A']:
+ cur_dict['winner'] = 'model_b'
+ else:
+ continue
+
+ cur_dict['category'] = reference['category']
+ cur_dict['difficulty'] = reference['difficulty']
+ cur_dict['model_a'] = reference['answer1']
+ cur_dict['model_b'] = reference['answer2']
+ cur_dict['prediction1'] = reference['prediction1']
+ cur_dict['prediction2'] = reference['prediction2']
+
+ matches.append(cur_dict)
+
+ ### ---------- Add Style Metadata ---------- ###
+ matches = get_element_counts(
+ data=matches,
+ column='prediction1',
+ suffix='_a',
+ )
+ matches = get_element_counts(
+ data=matches,
+ column='prediction2',
+ suffix='_b',
+ )
+
+ results['matches'] = matches
+ # results["details"] = output
+
+ return results
diff --git a/opencompass/datasets/subjective/utils.py b/opencompass/datasets/subjective/utils.py
index ce6bd8b5..a9228dcf 100644
--- a/opencompass/datasets/subjective/utils.py
+++ b/opencompass/datasets/subjective/utils.py
@@ -3,14 +3,15 @@ def get_judgeanswer_and_reference(result, filename, post_process):
"""Extract judgements (scores) and references.
Args:
- dataset (ConfigDict): Dataset config.
- subdir_path (str): Model path in results dir.
+ result (ConfigDict): Dataset config.
+ filename (str): Model path in results dir.
post_process (function): The pre-defined extract function.
"""
if len(result) == 0:
print('*' * 100)
print('There are no results for ' + filename)
print('*' * 100)
+
judged_answers = []
references = []
for k, v in result.items():
@@ -21,10 +22,12 @@ def get_judgeanswer_and_reference(result, filename, post_process):
# else:
# print(v['prediction'])
# print('-' * 128)
+
if len(judged_answers) <= 0.95 * len(result):
print('*' * 100)
print(
f'For your {filename} judge. Among {len(result)} judgements, successfully extracted {len(judged_answers)} judgements, please check!'
)
print('*' * 100)
+
return judged_answers, references
diff --git a/opencompass/openicl/icl_evaluator/lm_evaluator.py b/opencompass/openicl/icl_evaluator/lm_evaluator.py
index 489db9e0..1ab25780 100644
--- a/opencompass/openicl/icl_evaluator/lm_evaluator.py
+++ b/opencompass/openicl/icl_evaluator/lm_evaluator.py
@@ -1,5 +1,4 @@
# flake8: noqa: E501
-# yapf: disable
import os.path as osp
import random
import re
@@ -27,7 +26,13 @@ def extract_dicts(data):
return predictions
-def order_preds_and_record_references(predictions, references, infer_order, seed=666):
+def order_preds_and_record_references(
+ predictions: List,
+ references: List,
+ infer_order: List,
+ seed: int = 666,
+ keep_preds: bool = False,
+):
"""Order predictions based on args and recording regrading references.
Args:
@@ -35,23 +40,41 @@ def order_preds_and_record_references(predictions, references, infer_order, seed
references (List): List of reference based on each problem.
infer_order (str, optional): The mode of inference order.
seed (int, optional): Random seed.
+ keep_preds (bool, optional): Whether to save model predictions in references. This will be available as input in postprocessor. Defaults to False.
"""
random.seed(seed)
list_of_preds = [[] for _ in range(len(predictions))]
for i in range(len(predictions[0]['model_preds'])):
- preds = [[pred['model_preds'][i], pred['model_name']] for pred in predictions]
+ preds = [[pred['model_preds'][i], pred['model_name']]
+ for pred in predictions]
if infer_order == 'random':
random.shuffle(preds)
for j in range(len(preds)):
list_of_preds[j].append(preds[j][0])
references[i][f'answer{j+1}'] = preds[j][1]
+
+ if keep_preds:
+ references[i][f'prediction{j+1}'] = preds[j][0]
+
if infer_order == 'double':
assert len(predictions) == 2
- list_of_preds = [a + b for a, b in zip(list_of_preds, reversed(list_of_preds))]
+ list_of_preds = [
+ a + b for a, b in zip(list_of_preds, reversed(list_of_preds))
+ ]
reversed_references = []
for item in references:
reversed_item = item.copy()
- reversed_item['answer1'], reversed_item['answer2'] = reversed_item['answer2'], reversed_item['answer1']
+ reversed_item['answer1'], reversed_item['answer2'] = (
+ reversed_item['answer2'],
+ reversed_item['answer1'],
+ )
+
+ if keep_preds:
+ reversed_item['prediction1'], reversed_item['prediction2'] = (
+ reversed_item['prediction2'],
+ reversed_item['prediction1'],
+ )
+
reversed_references.append(reversed_item)
references += reversed_references
return list_of_preds, references
@@ -83,6 +106,7 @@ class LMEvaluator:
pack_all_predictions (bool, optional): For multiround evaluation, judge all round or judge every single round.
pred_postprocessor (ConfigDict): The model prediction's postprocessor
config.
+ keep_predictions (bool): Whether to save model predictions in references. Useful when postprocessor requires model predictions as input to calculate additional features (e.g. response length, markdown list counts, ...). Defaults to False.
"""
def __init__(
@@ -95,6 +119,7 @@ class LMEvaluator:
dataset_cfg: Optional[ConfigDict] = None,
pred_postprocessor: Optional[ConfigDict] = None,
dict_postprocessor: Optional[ConfigDict] = None,
+ keep_predictions: bool = False,
) -> None:
self.output_path = output_path
out_dir, out_name = osp.split(output_path)
@@ -103,34 +128,48 @@ class LMEvaluator:
self.prompt_tmpl = ICL_PROMPT_TEMPLATES.build(prompt_template)
if meta_review_prompt_template is not None:
- self.meta_review_prompt_tmpl = ICL_PROMPT_TEMPLATES.build(meta_review_prompt_template)
+ self.meta_review_prompt_tmpl = ICL_PROMPT_TEMPLATES.build(
+ meta_review_prompt_template)
max_out_len = judge_cfg.get('max_out_len', None)
batch_size = judge_cfg.get('batch_size', None)
model = build_model_from_cfg(model_cfg=judge_cfg)
- self.inferencer = GenInferencer(model,
- max_out_len=max_out_len,
- batch_size=batch_size,
- output_json_filepath=out_dir,
- output_json_filename=out_name)
+ self.inferencer = GenInferencer(
+ model,
+ max_out_len=max_out_len,
+ batch_size=batch_size,
+ output_json_filepath=out_dir,
+ output_json_filename=out_name,
+ )
self.logger = get_logger()
self.dataset_cfg = dataset_cfg
self.pack_all_predictions = pack_all_predictions
self.pred_postprocessor = pred_postprocessor
self.dict_postprocessor = dict_postprocessor
+ self.keep_predictions = keep_predictions
- def score(self,
- predictions,
- judgements: Optional[List] = None,
- references: Optional[List] = None,
- meta: Optional[bool] = False,
- infer_order: Optional[str] = 'random') -> Dict:
+ def score(
+ self,
+ predictions,
+ judgements: Optional[List] = None,
+ references: Optional[List] = None,
+ meta: Optional[bool] = False,
+ infer_order: Optional[str] = 'random',
+ ) -> Dict:
dup_indices = []
if isinstance(predictions, list):
"""Apply to multi-model comparison."""
if references is None:
- references = [{} for _ in range(len(predictions[0]['model_preds']))]
- predictions, references = order_preds_and_record_references(predictions, references, infer_order)
+ references = [
+ {} for _ in range(len(predictions[0]['model_preds']))
+ ]
+
+ predictions, references = order_preds_and_record_references(
+ predictions=predictions,
+ references=references,
+ infer_order=infer_order,
+ keep_preds=self.keep_predictions,
+ )
# calculate dupicated predictions numbers
total_predictions_num = len(predictions[0])
@@ -145,7 +184,9 @@ class LMEvaluator:
elif isinstance(predictions, dict):
"""Apply to single-model scoring."""
if references is None:
- references = [{} for _ in range(len(predictions[0]['model_preds']))]
+ references = [
+ {} for _ in range(len(predictions[0]['model_preds']))
+ ]
predictions = [predictions['model_preds']]
# Due to the rarity of identical predictions, we have temporarily disabled the plagiarism detection feature.
@@ -166,20 +207,27 @@ class LMEvaluator:
gold_key = 'obj_gold'
pred_dict[key] = predictions[i]
pred_dict[gold_key] = references
- pred_dict[key + '_en_word_count'] = [count_english_words(j) for j in predictions[i]]
- pred_dict[key + '_cn_word_count'] = [count_chinese_characters(j) for j in predictions[i]]
+ pred_dict[key + '_en_word_count'] = [
+ count_english_words(j) for j in predictions[i]
+ ]
+ pred_dict[key + '_cn_word_count'] = [
+ count_chinese_characters(j) for j in predictions[i]
+ ]
if judgements:
for i in range(len(judgements)):
key = 'judgement' if i == 0 else f'judgement{i + 1}'
pred_dict[key] = judgements[i]['model_preds']
for j in range(len(references)):
- references[j]['judge_model' + str(i + 1)] = judgements[i]['model_name']
+ references[j]['judge_model' +
+ str(i + 1)] = judgements[i]['model_name']
elif isinstance(predictions[0][0], list):
# multi round for format like [[[{'round':1, 'user':'', 'assistant':''}, {'round':2, 'user':'', 'assistant':''}], [{'round':1, 'user':'', 'assistant':''}, {'round':2, 'user':'', 'assistant':''}]]]
if self.pack_all_predictions:
for i in range(len(predictions)):
key = 'prediction' if i == 0 else f'prediction{i + 1}'
- predictions[i] = [str(_) for _ in predictions[i]] # Fix the dictionary order to prevent the following situations: {'assistant':'', 'round':2, 'user':''}
+ predictions[i] = [
+ str(_) for _ in predictions[i]
+ ] # Fix the dictionary order to prevent the following situations: {'assistant':'', 'round':2, 'user':''}
pred_dict[key] = predictions[i]
else:
for i in range(len(predictions)):
@@ -192,44 +240,62 @@ class LMEvaluator:
raise NotImplementedError(
'Not applied meta-reivew judge on multi-round dataset')
else:
- raise NotImplementedError(f'{predictions[0][0]} with type {type(predictions[0][0])}, please check the postprocess you add to the prediction string is right or not, we suggest to return an empty string but not None')
+ raise NotImplementedError(
+ f'{predictions[0][0]} with type {type(predictions[0][0])}, please check the postprocess you add to the prediction string is right or not, we suggest to return an empty string but not None'
+ )
if self.dataset_cfg:
dataset = build_dataset_from_cfg(self.dataset_cfg)
if infer_order == 'double':
- new_ds = {k: dataset.test[k] * 2 for k in dataset.test.column_names}
+ new_ds = {
+ k: dataset.test[k] * 2
+ for k in dataset.test.column_names
+ }
dataset.reader.dataset['test'] = Dataset.from_dict(new_ds)
if len(dup_indices) != 0:
- remaining_indices = [idx for idx in range(len(dataset.test)) if idx not in dup_indices]
- dataset.reader.dataset['test'] = dataset.test.select(remaining_indices)
- print(f'Among total {total_predictions_num} predictions, there are {len(dup_indices)} predictions totally same, which are removed!')
+ remaining_indices = [
+ idx for idx in range(len(dataset.test))
+ if idx not in dup_indices
+ ]
+ dataset.reader.dataset['test'] = dataset.test.select(
+ remaining_indices)
+ print(
+ f'Among total {total_predictions_num} predictions, there are {len(dup_indices)} predictions totally same, which are removed!'
+ )
for k, v in pred_dict.items():
dataset.reader.dataset['test'] = dataset.test.add_column(k, v)
dataset.reader.input_columns.append(k)
if references:
dataset.reader.input_columns.append('reference')
- dataset.reader.dataset['test'] = dataset.test.add_column('reference', references)
+ dataset.reader.dataset['test'] = dataset.test.add_column(
+ 'reference', references)
else:
# build a default dataset just for comparison
from opencompass.datasets.lmeval import LMEvalDataset
+
input_columns = list(pred_dict.keys())
if references:
input_columns.append('reference')
dataset = LMEvalDataset(
- reader_cfg=dict(input_columns=input_columns, output_column=None, train_split='test'),
+ reader_cfg=dict(input_columns=input_columns,
+ output_column=None,
+ train_split='test'),
reference=references,
- **pred_dict
+ **pred_dict,
)
dataset.reader.output_column = 'reference'
retriever = ZeroRetriever(dataset)
if meta:
- self.inferencer.inference(retriever=retriever, prompt_template=self.meta_review_prompt_tmpl)
+ self.inferencer.inference(
+ retriever=retriever,
+ prompt_template=self.meta_review_prompt_tmpl)
else:
- self.inferencer.inference(retriever=retriever, prompt_template=self.prompt_tmpl)
+ self.inferencer.inference(retriever=retriever,
+ prompt_template=self.prompt_tmpl)
output = mmengine.load(self.output_path)
return self.postprocess(output)
diff --git a/opencompass/summarizers/subjective/__init__.py b/opencompass/summarizers/subjective/__init__.py
index f578fa28..a13fcd84 100644
--- a/opencompass/summarizers/subjective/__init__.py
+++ b/opencompass/summarizers/subjective/__init__.py
@@ -6,6 +6,7 @@ from .arenahard import ArenaHardSummarizer
from .charm import CharmMemSummarizer
from .common_summarizer import CommonSummarizer
from .compass_arena import CompassArenaSummarizer
+from .compass_arena_bradley_terry import CompassArenaBradleyTerrySummarizer
from .compassbench import CompassBenchSummarizer
from .corev2 import Corev2Summarizer
from .creationbench import CreationBenchSummarizer
diff --git a/opencompass/summarizers/subjective/compass_arena_bradley_terry.py b/opencompass/summarizers/subjective/compass_arena_bradley_terry.py
new file mode 100644
index 00000000..eb02801a
--- /dev/null
+++ b/opencompass/summarizers/subjective/compass_arena_bradley_terry.py
@@ -0,0 +1,1019 @@
+# flake8: noqa
+import functools
+import getpass
+import json
+import math
+import multiprocessing as mp
+import os
+import os.path as osp
+from datetime import datetime
+from functools import partial
+from typing import Any, Dict, List, Optional, Tuple
+
+import mmengine
+import numpy as np
+import pandas as pd
+import tabulate
+from mmengine import ConfigDict
+from scipy.optimize import minimize
+from scipy.special import expit
+from tqdm import tqdm
+
+from opencompass.summarizers import DefaultSubjectiveSummarizer
+from opencompass.summarizers.default_subjective import \
+ model_abbr_from_cfg_used_in_summarizer
+from opencompass.utils import (LarkReporter, dataset_abbr_from_cfg,
+ get_infer_output_path, get_logger,
+ model_abbr_from_cfg)
+from opencompass.utils.prompt import get_prompt_hash
+
+STYLE_CONTROL_VARIABLES_V1 = [
+ 'sum_assistant_tokens',
+ 'header_count',
+ 'list_count',
+ 'bold_count',
+]
+
+EXTRA_CONTROL_VARIABLES = []
+
+
+def get_matchups_models(df):
+ n_rows = len(df)
+ model_indices, models = pd.factorize(
+ pd.concat([df['model_a'], df['model_b']]))
+ matchups = np.column_stack(
+ [model_indices[:n_rows], model_indices[n_rows:]])
+ return matchups, models.to_list()
+
+
+def preprocess_for_elo(df):
+ """
+ in Elo we want numpy arrays for matchups and outcomes
+ matchups: int32 (N,2) contains model ids for the competitors in a match
+ outcomes: float64 (N,) contains 1.0, 0.5, or 0.0 representing win, tie, or loss for model_a
+ """
+ matchups, models = get_matchups_models(df)
+ outcomes = np.full(len(df), 0.5)
+ outcomes[df['winner'] == 'model_a'] = 1.0
+ outcomes[df['winner'] == 'model_b'] = 0.0
+ return matchups, outcomes, models
+
+
+def preprocess_for_bt(df):
+ """in BT we only need the unique (matchup,outcome) sets along with the
+ weights of how often they occur."""
+ n_rows = len(df)
+ # the 3 columns of schedule represent: model_a id, model_b id, outcome_id
+ schedule = np.full((n_rows, 3), fill_value=1, dtype=np.int32)
+ # set the two model cols by mapping the model names to their int ids
+ schedule[:, [0, 1]], models = get_matchups_models(df)
+ # map outcomes to integers (must be same dtype as model ids so it can be in the same array)
+ # model_a win -> 2, tie -> 1 (prefilled by default), model_b win -> 0
+ schedule[df['winner'] == 'model_a', 2] = 2
+ schedule[df['winner'] == 'model_b', 2] = 0
+ # count the number of occurrences of each observed result
+ matchups_outcomes, weights = np.unique(schedule,
+ return_counts=True,
+ axis=0)
+ matchups = matchups_outcomes[:, [0, 1]]
+ # map 2 -> 1.0, 1 -> 0.5, 0 -> 0.0 which will be used as labels during optimization
+ outcomes = matchups_outcomes[:, 2].astype(np.float64) / 2.0
+ weights = weights.astype(np.float64)
+ # each possible result is weighted according to number of times it occurred in the dataset
+ return matchups, outcomes, models, weights
+
+
+def preprocess_for_style(
+ df,
+ apply_ratio: List[int] = None,
+ style_variables: List[str] = STYLE_CONTROL_VARIABLES_V1,
+ control_variables: List[str] = EXTRA_CONTROL_VARIABLES,
+ style_var_suffixes: List[str] = None,
+ add_one: bool = True,
+ normalize_style_features: bool = True,
+):
+ matchups, outcomes, models = preprocess_for_elo(
+ df) # this can use the same preprocessing as Elo
+
+ n = matchups.shape[0]
+ style_k = int(len(style_variables))
+
+ if control_variables is not None:
+ control_k = int(len(control_variables))
+ else:
+ control_k = 0
+
+ if apply_ratio == None:
+ apply_ratio = np.repeat(1, style_k)
+
+ def extract_feature(x, feature):
+ val = x[feature]
+ if isinstance(val, int):
+ return val
+ else:
+ return sum(val.values())
+
+ ## Style variables
+ if style_var_suffixes is None:
+ style_var_suffixes = ['_a', '_b']
+
+ style_vector = np.zeros(shape=(2 * style_k, n), dtype=np.int32)
+ for idx1, model_suffix in enumerate(style_var_suffixes):
+ for idx, element in enumerate(style_variables):
+ style_vector[idx + (idx1 * style_k), :] = df.conv_metadata.map(
+ partial(extract_feature,
+ feature=f'{element}{model_suffix}')).values
+
+ style_vector = np.ascontiguousarray(style_vector)
+
+ style_diff = (style_vector[:style_k] -
+ style_vector[style_k:]).astype(float)
+ style_sum = (style_vector[:style_k] + style_vector[style_k:]).astype(float)
+
+ # Add one to prevent division by zero
+ if add_one:
+ style_sum = style_sum + np.ones(style_diff.shape)
+
+ apply_ratio = np.flatnonzero(apply_ratio)
+
+ # Apply ratio where necessary (length, etc)
+ style_diff[apply_ratio] /= style_sum[apply_ratio]
+
+ style_mean = np.mean(style_diff, axis=1)
+
+ if normalize_style_features:
+ style_std = np.std(style_diff, axis=1)
+
+ # # features = normalize(style_diff)
+ style_features = ((style_diff - style_mean[:, np.newaxis]) /
+ style_std[:, np.newaxis]).T
+ else:
+ style_features = style_diff.T
+
+ ## Other control variables
+ if control_k > 0:
+ control_vector = np.zeros(shape=(control_k, n), dtype=np.int32)
+ for idx, element in enumerate(control_variables):
+ control_vector[idx, :] = df[element]
+
+ control_vector = np.ascontiguousarray(control_vector).astype(float)
+
+ control_features = control_vector.T
+
+ # combine style and other control features
+ features = np.hstack([style_features, control_features])
+ else:
+ features = style_features
+
+ return matchups, features, outcomes, models
+
+
+def fit_vectorized_elo(
+ matchups,
+ outcomes,
+ sample_indices,
+ num_models: int,
+ k: float = 4.0,
+ base: float = 10.0,
+ init_rating: float = 1000.0,
+ scale: float = 400.0,
+):
+ """fit multiple sets of Elo ratings on different samples of the data at the
+ same time."""
+ alpha = math.log(base) / scale
+ num_samples = sample_indices.shape[1]
+ ratings = np.zeros(shape=(num_samples, num_models), dtype=np.float64)
+ # iterate over the rows of sample_indices, each column is an index into a match in the input arrays
+ sample_range = np.arange(num_samples)
+ for matchup_indices in sample_indices:
+ model_a_indices = matchups[matchup_indices, 0]
+ model_b_indices = matchups[matchup_indices, 1]
+ model_a_ratings = ratings[sample_range, model_a_indices]
+ model_b_ratings = ratings[sample_range, model_b_indices]
+ sample_outcomes = outcomes[matchup_indices]
+ probs = expit(alpha * (model_a_ratings - model_b_ratings))
+ updates = k * (sample_outcomes - probs)
+ ratings[sample_range, model_a_indices] += updates
+ ratings[sample_range, model_b_indices] -= updates
+ return ratings + init_rating
+
+
+def compute_elo(
+ df,
+ k: float = 4.0,
+ base: float = 10.0,
+ init_rating: float = 1000.0,
+ scale: float = 400.0,
+):
+ matchups, outcomes, models = preprocess_for_elo(df)
+ alpha = math.log(base) / scale
+ ratings = np.full(shape=(len(models), ), fill_value=init_rating)
+
+ for (model_a_idx, model_b_idx), outcome in zip(matchups, outcomes):
+ prob = 1.0 / (1.0 +
+ math.exp(alpha *
+ (ratings[model_b_idx] - ratings[model_a_idx])))
+ update = k * (outcome - prob)
+ ratings[model_a_idx] += update
+ ratings[model_b_idx] -= update
+
+ return {model: ratings[idx] for idx, model in enumerate(models)}
+
+
+def compute_bootstrap_elo(
+ df,
+ num_round: int = 100,
+ k: float = 4.0,
+ base: float = 10.0,
+ init_rating: float = 1000.0,
+ scale: float = 400.0,
+):
+ matchups, outcomes, models = preprocess_for_elo(df)
+ sample_indices = np.random.randint(low=0,
+ high=len(df),
+ size=(len(df), num_round))
+ ratings = fit_vectorized_elo(matchups, outcomes, sample_indices,
+ len(models), k, base, init_rating, scale)
+ df = pd.DataFrame(data=ratings, columns=models)
+ return df[df.median().sort_values(ascending=False).index]
+
+
+def bt_loss_and_grad(ratings, matchups, outcomes, weights, alpha=1.0):
+ matchup_ratings = ratings[matchups]
+ logits = alpha * (matchup_ratings[:, 0] - matchup_ratings[:, 1])
+ probs = expit(logits)
+ # this form naturally counts a draw as half a win and half a loss
+ loss = -((np.log(probs) * outcomes + np.log(1.0 - probs) *
+ (1.0 - outcomes)) * weights).sum()
+ matchups_grads = -alpha * (outcomes - probs) * weights
+ model_grad = np.zeros_like(ratings)
+ # aggregate gradients at the model level using the indices in matchups
+ np.add.at(
+ model_grad,
+ matchups[:, [0, 1]],
+ matchups_grads[:, None] * np.array([1.0, -1.0], dtype=np.float64),
+ )
+ return loss, model_grad
+
+
+def fit_bt(matchups, outcomes, weights, n_models, alpha, tol=1e-6):
+ initial_ratings = np.zeros(n_models, dtype=np.float64)
+ result = minimize(
+ fun=bt_loss_and_grad,
+ x0=initial_ratings,
+ args=(matchups, outcomes, weights, alpha),
+ jac=True,
+ method='L-BFGS-B',
+ options={
+ 'disp': False,
+ 'maxiter': 100,
+ 'gtol': tol
+ },
+ )
+ return result['x']
+
+
+def scale_and_offset(
+ ratings,
+ models,
+ scale: float = 400.0,
+ init_rating: float = 1000.0,
+ baseline_model: str = None,
+ baseline_rating: float = 1000.0,
+):
+ """convert ratings from the natural scale to the Elo rating scale with an
+ anchored baseline."""
+ scaled_ratings = (ratings * scale) + init_rating
+
+ if baseline_model is not None:
+ if baseline_model in models:
+ baseline_idx = models.index(baseline_model)
+ scaled_ratings += baseline_rating - scaled_ratings[...,
+ [baseline_idx]]
+
+ return scaled_ratings
+
+
+def compute_bt(
+ df,
+ base: float = 10.0,
+ scale: float = 400.0,
+ init_rating: float = 1000.0,
+ baseline_model: str = None,
+ baseline_rating: float = 1000.0,
+ tol: float = 1e-6,
+):
+ matchups, outcomes, models, weights = preprocess_for_bt(df)
+ ratings = fit_bt(matchups, outcomes, weights, len(models), math.log(base),
+ tol)
+
+ scaled_ratings = scale_and_offset(
+ ratings=ratings,
+ models=models,
+ scale=scale,
+ init_rating=init_rating,
+ baseline_model=baseline_model,
+ baseline_rating=baseline_rating,
+ )
+
+ return pd.Series(scaled_ratings, index=models).sort_values(ascending=False)
+
+
+def compute_bootstrap_bt(
+ battles,
+ num_round: int,
+ base: float = 10.0,
+ scale: float = 400.0,
+ init_rating: float = 1000.0,
+ baseline_model: str = None,
+ baseline_rating: float = 1000.0,
+ tol: float = 1e-6,
+ num_cpu: int = None,
+):
+ matchups, outcomes, models, weights = preprocess_for_bt(battles)
+ # bootstrap sample the unique outcomes and their counts directly using the multinomial distribution
+ rng = np.random.default_rng(seed=0)
+ idxs = rng.multinomial(n=len(battles),
+ pvals=weights / weights.sum(),
+ size=(num_round))
+ # only the distribution over their occurrence counts changes between samples (and it can be 0)
+ boot_weights = idxs.astype(np.float64) / len(battles)
+
+ # the only thing different across samples is the distribution of weights
+ bt_fn = partial(fit_bt,
+ matchups,
+ outcomes,
+ n_models=len(models),
+ alpha=np.log(base),
+ tol=tol)
+ with mp.Pool(num_cpu if num_cpu else os.cpu_count() - 1) as pool:
+ results = list(
+ tqdm(pool.imap_unordered(bt_fn, boot_weights), total=num_round))
+
+ ratings = np.array(results)
+
+ scaled_ratings = scale_and_offset(
+ ratings=ratings,
+ models=models,
+ scale=scale,
+ init_rating=init_rating,
+ baseline_model=baseline_model,
+ baseline_rating=baseline_rating,
+ )
+
+ df = pd.DataFrame(scaled_ratings, columns=models)
+ return df[df.median().sort_values(ascending=False).index]
+
+
+DIFF_MASK = np.array(
+ [1.0, -1.0], dtype=np.float64
+) # create globally to not incur the instantiation cost in each call
+
+
+def contextual_bt_loss_and_grad(
+ params,
+ n_competitors,
+ matchups,
+ features,
+ outcomes,
+ alpha=1.0,
+ reg=1.0,
+ half_reg=0.5,
+):
+ reg_loss = half_reg * np.inner(params, params)
+
+ # Split params into ratings and feature parameters
+ ratings = params[:n_competitors]
+ feature_params = params[n_competitors:]
+
+ matchup_ratings = ratings[matchups]
+ bt_logits = alpha * (matchup_ratings[:, 0] - matchup_ratings[:, 1])
+ context_logits = np.dot(features, feature_params)
+ probs = expit(bt_logits + context_logits)
+ loss = (-((np.log(probs) * outcomes + np.log(1.0 - probs) *
+ (1.0 - outcomes))).sum() + reg_loss)
+
+ error = outcomes - probs
+ grad = reg * params # initialize the grad as the regularization grad
+ matchups_grads = -alpha * error
+ np.add.at(grad[:n_competitors], matchups[:, [0, 1]],
+ matchups_grads[:, None] * DIFF_MASK)
+ grad[n_competitors:] -= np.dot(features.T, error)
+ return loss, grad
+
+
+# note on regularization:
+# default reg is to 0.5 since the LogisticRegression default is 1.0
+# in the original implementation, matchups were duplicated
+# that made the ratio of log loss to reg loss "twice as high"
+# in this non-duplicated version for parity we also reduce the reg by one half to match
+def fit_contextual_bt(
+ matchups,
+ features,
+ outcomes,
+ models,
+ idxs=None,
+ alpha=math.log(10.0),
+ reg=0.5,
+ tol=1e-6,
+):
+ n_features = features.shape[1]
+ n_models = len(models)
+ initial_params = np.zeros(n_models + n_features, dtype=np.float64)
+ half_reg = reg / 2.0
+
+ # sample idxs optionally allow for fitting on a bootstrap sample of the dataset
+ if idxs is not None:
+ matchups, features, outcomes = matchups[idxs], features[
+ idxs], outcomes[idxs]
+
+ result = minimize(
+ fun=contextual_bt_loss_and_grad,
+ x0=initial_params,
+ args=(n_models, matchups, features, outcomes, alpha, reg, half_reg),
+ jac=True,
+ method='L-BFGS-B',
+ options={
+ 'disp': False,
+ 'maxiter': 100,
+ 'gtol': tol
+ },
+ )
+ return result['x']
+
+
+def compute_style_control(
+ df: pd.DataFrame,
+ alpha: float = math.log(10.0),
+ reg: float = 0.5,
+ scale: float = 400.0,
+ init_rating: float = 1000.0,
+ baseline_model: str = None,
+ baseline_rating: float = 1000.0,
+ normalize_style_features: bool = True,
+ control_variables: List[str] = None,
+ odds_ratio: bool = True,
+ tol: float = 1e-6,
+):
+ if control_variables is not None:
+ _df = pd.get_dummies(
+ data=df,
+ columns=control_variables,
+ drop_first=
+ False, # Since the model is fitted without an intercept, we keep all levels of each categorical
+ )
+
+ # One-hot encode categorical control variables
+ one_hot_ctrls = []
+ for col in _df.columns:
+ for ctrl_var in control_variables:
+ if col.startswith(ctrl_var):
+ one_hot_ctrls.append(col)
+ break
+
+ matchups, features, outcomes, models = preprocess_for_style(
+ _df,
+ normalize_style_features=normalize_style_features,
+ style_variables=STYLE_CONTROL_VARIABLES_V1,
+ control_variables=one_hot_ctrls,
+ )
+ ratings_params = fit_contextual_bt(
+ matchups,
+ features,
+ outcomes,
+ models=models,
+ alpha=alpha,
+ reg=reg,
+ tol=tol,
+ )
+ ratings = ratings_params[:len(models)]
+
+ if odds_ratio:
+ params = np.exp(ratings_params[len(models):])
+ else:
+ params = ratings_params[len(models):]
+
+ scaled_ratings = scale_and_offset(
+ ratings=ratings,
+ models=models,
+ scale=scale,
+ init_rating=init_rating,
+ baseline_model=baseline_model,
+ baseline_rating=baseline_rating,
+ )
+ scaled_ratings = pd.Series(scaled_ratings,
+ index=models).sort_values(ascending=False)
+
+ control_coefficients = {
+ k: v
+ for k, v in zip(STYLE_CONTROL_VARIABLES_V1 + one_hot_ctrls, params)
+ }
+
+ return scaled_ratings, control_coefficients
+
+
+def compute_bootstrap_style_control(
+ df,
+ num_round: int,
+ alpha: float = math.log(10.0),
+ reg: float = 0.5,
+ scale: float = 400.0,
+ init_rating: float = 1000.0,
+ baseline_model: str = None,
+ baseline_rating: float = 1000.0,
+ normalize_style_features: bool = True,
+ control_variables: List[str] = None,
+ odds_ratio: bool = True,
+ tol: float = 1e-6,
+ num_cpu: int = None,
+):
+ if control_variables is not None:
+ _df = pd.get_dummies(
+ data=df,
+ columns=control_variables,
+ drop_first=
+ False, # Since the model is fitted without an intercept, we keep all levels of each categorical
+ )
+
+ # One-hot encode categorical control variables
+ one_hot_ctrls = []
+ for col in _df.columns:
+ for ctrl_var in control_variables:
+ if col.startswith(ctrl_var):
+ one_hot_ctrls.append(col)
+ break
+
+ matchups, features, outcomes, models = preprocess_for_style(
+ _df,
+ normalize_style_features=normalize_style_features,
+ style_variables=STYLE_CONTROL_VARIABLES_V1,
+ control_variables=one_hot_ctrls,
+ )
+
+ contextual_bt_fn = partial(
+ fit_contextual_bt,
+ matchups,
+ features,
+ outcomes,
+ models,
+ alpha=alpha,
+ reg=reg,
+ tol=tol,
+ )
+
+ boot_idxs = np.random.randint(low=0,
+ high=matchups.shape[0],
+ size=(num_round, matchups.shape[0]))
+
+ with mp.Pool(num_cpu if num_cpu else os.cpu_count()) as pool:
+ results = list(
+ tqdm(pool.imap_unordered(contextual_bt_fn, boot_idxs),
+ total=num_round))
+
+ ratings_params = np.array(results)
+ ratings = ratings_params[:, :len(models)]
+
+ if odds_ratio:
+ params = np.exp(ratings_params[:, len(models):].mean(axis=0))
+ else:
+ params = ratings_params[:, len(models):].mean(axis=0)
+
+ scaled_ratings = scale_and_offset(
+ ratings=ratings,
+ models=models,
+ scale=scale,
+ init_rating=init_rating,
+ baseline_model=baseline_model,
+ baseline_rating=baseline_rating,
+ )
+ df = pd.DataFrame(scaled_ratings, columns=models)
+
+ control_coefficients = {
+ k: v
+ for k, v in zip(STYLE_CONTROL_VARIABLES_V1 + one_hot_ctrls, params)
+ }
+
+ return df[df.median().sort_values(
+ ascending=False).index], control_coefficients
+
+
+class CompassArenaBradleyTerrySummarizer(DefaultSubjectiveSummarizer):
+ """Summarizer for fitting and Bradley-Terry model to pairwise matchups
+ according to https://github.com/lm-sys/FastChat/tree/main.
+
+ Args:
+ config (ConfigDict): The configuration object of the evaluation task. It's expected to be filled out at runtime.
+ dataset_abbrs (Optional[List[str]], optional): Dataset abbreviations to be listed in the summary. Defaults to None.
+ summary_groups (List, optional): Passed to DefaultSubjectiveSummarizer. Not used for this class. Defaults to None.
+ prompt_db (_type_, optional): Legacy parameter kept for backward compatibility. Defaults to None.
+ rating_system (str, optional): Rating system used. Currently only supports "bradleyterry". Defaults to "bradleyterry".
+ num_bootstrap (int, optional): The number of bootstraps for estimating the confidence intervals. Defaults to 300.
+ num_cpu (int, optional): The number of CPUs to use for the BT bootstrapping process. Defaults to None.
+ with_control_vars (bool, optional): Whether to include additional covariates (including style features and group variables) when fitting the BT model. Defaults to True.
+ normalize_style_features (bool, optional): Whether to normalize style features BEFORE fitting the BT model (implementation by FastChat). Turn this off for easier interpretation of odds ratios (when odds_ratio==True). Defaults to True.
+ odds_ratio (bool, optional): Whether to report odds ratios (np.exp(beta_k)) instead of the original coefficients. Defaults to True.
+ groups (List[str], optional): Group variables to include while fitting the BT model. These must be available in the input dataset for each observation. Defaults to None.
+ """
+
+ def __init__(
+ self,
+ config: ConfigDict,
+ dataset_abbrs: Optional[List[str]] = None,
+ summary_groups: List = None,
+ prompt_db=None,
+ rating_system: str = 'bradleyterry',
+ num_bootstrap: int = 300,
+ num_cpu: int = None,
+ with_control_vars: bool = True,
+ normalize_style_features: bool = True,
+ odds_ratio: bool = True,
+ groups: List[str] = None,
+ ) -> None:
+ summary_groups = [] if summary_groups is None else summary_groups
+ super().__init__(config, dataset_abbrs, summary_groups, prompt_db)
+
+ self.summarizer_cfg = self.cfg['summarizer']
+ self.rating_system = 'bradleyterry' # Only bradleyterry supported
+ self.num_bootstrap = num_bootstrap
+ self.num_cpu = num_cpu
+ self.with_control_vars = with_control_vars
+ self.normalize_style_features = normalize_style_features
+ self.odds_ratio = odds_ratio
+ self.groups = [] if groups is None else groups
+
+ def _pick_up_results(self, judge_abbr):
+ """The function reads the numerical results of evaluations from the
+ output folder based on the configuration file, and ultimately returns
+ four dictionaries, each containing processed information in different
+ formats. The contents of the four dictionaries are as follows:
+
+ - raw_results: contains the raw results of each model on each dataset (excluding details).
+ - parsed_results: contains the results of each model on each dataset for each metric, with metrics in METRIC_BLACKLIST being ignored.
+ - dataset_metrics: contains the list of metrics for each dataset, consistent with the metrics in parsed_results. The list is ordered according to the METRIC_WHITELIST,
+ with metrics appearing earlier considered more important.
+ - dataset_eval_mode: contains the evaluation mode for each dataset.
+ """
+ # raw_results: {model_abbr: {dataset_abbr: result}}
+ raw_results: Dict[str, Dict[str, Any]] = {}
+ # # parsed_results: {model_abbr: {dataset_abbr: {metric: score}}}
+ # parsed_results: Dict[str, Dict[str, Dict[str, float]]] = {}
+ # # dataset_metrics: {dataset_abbr: [metric]}
+ # dataset_metrics: Dict[str, List[str]] = {}
+
+ for model in self.model_cfgs:
+ model_abbr = model_abbr_from_cfg_used_in_summarizer(model)
+ # parsed_results.setdefault(model_abbr, {})
+ # raw_results.setdefault(model_abbr, {})
+
+ for dataset in self.dataset_cfgs:
+ base_models = dataset.get('base_models', None)
+ if base_models is None:
+ raise ValueError(
+ 'CompassArenaBradleyTerrySummarizer requires at least one `base_model` in specified in the dataset config.'
+ )
+
+ base_models_list = [item['abbr'] for item in base_models]
+
+ dataset_abbr = dataset_abbr_from_cfg(dataset)
+ raw_results.setdefault(dataset_abbr, {})
+
+ for base_model_abbr in base_models_list:
+ raw_results[dataset_abbr].setdefault(base_model_abbr, [])
+
+ origin_path = get_infer_output_path(
+ model, dataset, osp.join(self.work_dir, 'results'))
+ if base_model_abbr != '':
+ temp_path, dataset_json_name = (
+ origin_path.rsplit('/', 1)[0],
+ origin_path.rsplit('/', 1)[1],
+ )
+ filepath = osp.join(
+ temp_path.rsplit('/', 1)[0],
+ base_model_abbr + '_' +
+ temp_path.rsplit('/', 1)[1] + '_judged-by--' +
+ judge_abbr,
+ dataset_json_name,
+ )
+ else:
+ filepath = osp.join(
+ origin_path.rsplit('/', 1)[0] + '_judged-by--' +
+ judge_abbr,
+ origin_path.rsplit('/', 1)[1],
+ )
+ if not osp.exists(filepath):
+ continue
+
+ result = mmengine.load(filepath)
+ result.pop('details', None)
+
+ # raw_results[dataset_abbr] = result
+ raw_results[dataset_abbr][base_model_abbr].extend(
+ result['matches'])
+
+ if 'error' in result:
+ self.logger.debug(
+ f'error in {model_abbr} {dataset_abbr} {result["error"]}'
+ )
+ continue
+
+ # dataset_eval_mode: {dataset_abbr: eval_mode}
+ dataset_eval_mode: Dict[str, str] = {}
+ for dataset in self.dataset_cfgs:
+ inferencer = (dataset.get('infer_cfg', {}).get('inferencer',
+ {}).get('type', ''))
+ inferencer = (inferencer if isinstance(inferencer, str) else
+ inferencer.__name__)
+ dataset_abbr = dataset_abbr_from_cfg(dataset)
+ if 'GenInferencer' in inferencer:
+ dataset_eval_mode[dataset_abbr] = 'gen'
+ elif 'PPLInferencer' in inferencer:
+ dataset_eval_mode[dataset_abbr] = 'ppl'
+ elif 'LLInferencer' in inferencer:
+ dataset_eval_mode[dataset_abbr] = 'll'
+ else:
+ dataset_eval_mode[dataset_abbr] = 'unknown'
+ self.logger.warning(
+ f'unknown inferencer: {inferencer} - {dataset_abbr}')
+
+ # return raw_results, parsed_results, dataset_metrics, dataset_eval_mode
+ return raw_results, dataset_eval_mode
+
+ def _calculate_ratings(
+ self,
+ matches: Dict,
+ base_model: str = None,
+ groups: List[str] = None,
+ ) -> Tuple[pd.DataFrame, Dict]:
+
+ rating_system = self.rating_system
+ num_bootstrap = self.num_bootstrap
+ num_cpu = self.num_cpu
+ with_control_vars = self.with_control_vars
+
+ matches_df = pd.DataFrame(matches)
+
+ num_battles = (matches_df['model_a'].value_counts().add(
+ matches_df['model_b'].value_counts(), fill_value=0))
+
+ # if rating_system == "bradleyterry":
+ if with_control_vars:
+ bootstrap_df, bootstrap_coef = compute_bootstrap_style_control(
+ df=matches_df,
+ num_round=num_bootstrap,
+ baseline_model=base_model,
+ normalize_style_features=self.normalize_style_features,
+ control_variables=groups,
+ odds_ratio=self.odds_ratio,
+ )
+ elo_rating_final, coef_final = compute_style_control(
+ df=matches_df,
+ baseline_model=base_model,
+ normalize_style_features=self.normalize_style_features,
+ control_variables=groups,
+ odds_ratio=self.odds_ratio,
+ )
+ else:
+ bootstrap_df = compute_bootstrap_bt(
+ battles=matches_df,
+ num_round=num_bootstrap,
+ baseline_model=base_model,
+ num_cpu=num_cpu,
+ )
+ elo_rating_final = compute_bt(
+ df=matches_df,
+ baseline_model=base_model,
+ )
+
+ # print(elo_rating_final)
+
+ # elif rating_system == "elo":
+ # bootstrap_df = compute_bootstrap_elo(
+ # df=matches_df,
+ # num_round=num_bootstrap,
+ # num_cpu=num_cpu,
+ # )
+ # elo_rating_final = compute_elo(matches_df)
+
+ model_rating_q025 = bootstrap_df.quantile(0.025)
+ model_rating_q975 = bootstrap_df.quantile(0.975)
+
+ # compute ranking based on CI
+ model_order = list(elo_rating_final.index)
+
+ ranking = {}
+ for i, model_a in enumerate(model_order):
+ ranking[model_a] = 1
+ for j, model_b in enumerate(model_order):
+ if i == j:
+ continue
+ if model_rating_q025[model_b] > model_rating_q975[model_a]:
+ ranking[model_a] += 1
+
+ leaderboard_table_df = pd.DataFrame(
+ {
+ 'rating': elo_rating_final,
+ 'ranking_ub': pd.Series(ranking),
+ 'std_dev': bootstrap_df.std(),
+ 'rating_q975': model_rating_q975,
+ 'rating_q025': model_rating_q025,
+ 'num_battles': num_battles,
+ }, )
+ leaderboard_table_df['model_name'] = leaderboard_table_df.index
+
+ leaderboard_table_df.sort_values(
+ by=['rating'],
+ ascending=False,
+ inplace=True,
+ )
+ leaderboard_table_df['ranking'] = np.arange(
+ 1,
+ len(leaderboard_table_df) + 1)
+
+ if rating_system == 'bradleyterry' and with_control_vars:
+ control_coefficients = {
+ 'bootstrap': bootstrap_coef,
+ 'final': coef_final,
+ }
+ else:
+ control_coefficients = {'final': []}
+
+ return leaderboard_table_df, control_coefficients['final']
+
+ def _output_to_file(
+ self,
+ output_path,
+ time_str: str,
+ tables: Dict,
+ metadata: Dict,
+ judge_abbr: str,
+ dataset_eval_mode: str,
+ ):
+ # Output to file
+ if output_path is None:
+ output_path = osp.join(self.work_dir, 'summary',
+ f'summary_{time_str}.json')
+ output_csv_path = osp.join(self.work_dir, 'summary',
+ f'summary_{time_str}.csv')
+ else:
+ output_csv_path = output_path.replace('.json', '.csv')
+ output_path = output_path.split(
+ '.json')[0] + '_by_' + judge_abbr + '.json'
+
+ output_dir = osp.split(output_path)[0]
+ mmengine.mkdir_or_exist(output_dir)
+
+ with open(output_path, 'w', encoding='utf-8') as f:
+ json.dump(metadata, f, ensure_ascii=False, indent=4)
+ self.logger.info(f'write summary to {osp.abspath(output_path)}')
+
+ prompt_version = {
+ dataset_abbr_from_cfg(d): get_prompt_hash(d)[:6]
+ for d in self.dataset_cfgs
+ }
+
+ full_results = []
+ for base_model_abbr, datasets in tables.items():
+ base_model_results = []
+ for dataset_abbr, table_df in datasets.items():
+ table_df['dataset'] = dataset_abbr
+ table_df['version'] = prompt_version.get(dataset_abbr, '-')
+ table_df['metric'] = 'bt_rating'
+ table_df['mode'] = dataset_eval_mode[dataset_abbr]
+ table_df['base_model'] = base_model_abbr
+
+ base_model_results.append(table_df)
+
+ cur_base_model_result_df = pd.concat(base_model_results)
+ full_results.append(cur_base_model_result_df)
+
+ full_results_df = pd.concat(full_results)
+ full_results_df = full_results_df[[
+ 'dataset',
+ 'version',
+ 'base_model',
+ 'metric',
+ 'mode',
+ 'ranking',
+ 'ranking_ub',
+ 'model_name',
+ 'rating',
+ 'rating_q975',
+ 'rating_q025',
+ 'std_dev',
+ 'num_battles',
+ ]]
+
+ output_csv_path = (output_csv_path.split('.csv')[0] + '_by_' +
+ judge_abbr + '.csv')
+
+ with pd.option_context(
+ 'display.max_rows',
+ 20,
+ 'display.max_columns',
+ 20,
+ 'display.expand_frame_repr',
+ False,
+ ):
+ print(full_results_df.reset_index(drop=True).round(2))
+
+ full_results_df.to_csv(
+ output_csv_path,
+ index=False,
+ )
+ self.logger.info(f'write csv to {osp.abspath(output_csv_path)}')
+
+ def flip_dict_levels(self, original_dict: Dict):
+ """Flips the two levels of a nested dictionary so that dict[lvl1][lvl2]
+ becomes dict[lvl2][lvl1].
+
+ Args:
+ original_dict (dict): The original nested dictionary.
+
+ Returns:
+ dict: The flipped dictionary.
+ """
+ flipped_dict = {}
+ for lvl1, lvl2_dict in original_dict.items():
+ for lvl2, value in lvl2_dict.items():
+ if lvl2 not in flipped_dict:
+ flipped_dict[lvl2] = {}
+ flipped_dict[lvl2][lvl1] = value
+
+ return flipped_dict
+
+ def summarize(
+ self,
+ output_path: str = None,
+ time_str: str = datetime.now().strftime('%Y%m%d_%H%M%S'),
+ ):
+ """Summarize evaluation results and format output table.
+
+ Args:
+ output_path (str, optional): Output path. Defaults to None.
+ time_str (str, optional): Timestamp for file suffix. Defaults to
+ datetime.now().strftime('%Y%m%d_%H%M%S').
+ """
+ all_scores = {}
+ for judge_model in self.judge_models:
+ control_coefficients = {}
+ leaderboard_tables = {}
+
+ judge_abbr = model_abbr_from_cfg(judge_model)
+
+ # pick up results
+ raw_results, dataset_eval_mode = self._pick_up_results(judge_abbr)
+
+ all_matches = []
+ for dataset_abbr, base_models in raw_results.items():
+ control_coefficients[dataset_abbr] = {}
+ leaderboard_tables[dataset_abbr] = {}
+
+ dataset_matches = base_models[list(base_models)[0]]
+ all_matches.extend(dataset_matches)
+
+ for base_model_abbr, matches in base_models.items():
+ cur_table_df, cur_ctrl_coefs = self._calculate_ratings(
+ matches=matches,
+ base_model=base_model_abbr,
+ groups=self.groups,
+ )
+
+ control_coefficients[dataset_abbr][
+ base_model_abbr] = cur_ctrl_coefs
+ leaderboard_tables[dataset_abbr][
+ base_model_abbr] = cur_table_df
+
+ print('-' * 10 +
+ f"{dataset_abbr + ':' + base_model_abbr}\n" +
+ '-' * 10)
+ # print(cur_table_df)
+ print(cur_ctrl_coefs)
+
+ leaderboard_tables = self.flip_dict_levels(leaderboard_tables)
+
+ # Output to .json / .csv files
+ self._output_to_file(
+ output_path=output_path,
+ time_str=time_str,
+ tables=leaderboard_tables,
+ metadata=control_coefficients,
+ judge_abbr=judge_abbr,
+ dataset_eval_mode=dataset_eval_mode,
+ )
+
+ # Fit another BT model with the first base_model and combining matches from all datasets
+ all_scores_df, all_scores_ctrl_coefs = self._calculate_ratings(
+ matches=all_matches,
+ base_model=list(base_models)[0],
+ groups=self.groups,
+ )
+
+ all_scores[judge_abbr] = pd.Series(
+ all_scores_df['rating'],
+ index=all_scores_df['model_name'],
+ ).to_dict()
+
+ print(f'{all_scores=}')
+ print(f'{all_scores_ctrl_coefs=}')
+
+ return {'CompassArenaSubjBenchBradleyTerry': all_scores}