OpenCompass

mirror of https://github.com/open-compass/opencompass.git synced 2025-05-30 16:03:24 +08:00

Author	SHA1	Message	Date
Hoter Young	362b281e55	[Feature] Support 3 models (#34 ) opencompass/configs/models/deepseek/lmdeploy_deepseek_r1_distill_llama_70b_instruct.py opencompass/configs/models/deepseek/lmdeploy_deepseek_r1_distill_qwen_14b_instruct.py opencompass/configs/models/hf_llama/llama3_3_70b_api_siliconflow.py	2025-02-14 22:01:16 +08:00
Hoter Young	879b181c1b	add some features (#32 ) * [Feature] Support answer extraction of QwQ when evaluating HuSimpleQA * [Feature] Support mulit-language summarization in HuSimpleQASummarizer * [Feature] Support DeepSeep-R1-Distill-Qwen_32B_turbomind	2025-02-14 20:44:53 +08:00
Hoter Young	0971777348	[Feature] Support DeepSeep-R1-Distill-Qwen_32B (#30 )	2025-02-13 21:42:16 +08:00
Hoter Young	f92a1e5050	[Feature] Support DeepSeep-R1 API from SenseTime (#29 )	2025-02-13 20:50:57 +08:00
Hoter Young	6f5c16edc5	[Chores] do some minor changes to HuLifeQA (#27 ) 1. enlarge token size 2. add two r1 distill models	2025-02-12 21:43:11 +08:00
hoteryoung	23210e089a	[Refactor] Change HuSimpleQA to subjective evaluation	2025-02-12 20:25:03 +08:00
wujiang	60ab611ecd	set deepseek r1 batchsize = 1	2025-02-11 21:55:51 +08:00
wujiang	e261a76e07	set reasoning model max_out_len = 8192	2025-02-11 16:51:05 +08:00
weixingjian	cb664d0cea	add hu prompt for HuMatchingFIB task	2025-02-11 12:20:22 +08:00
wujiang	b4ecd718a0	update examples and configs	2025-02-10 23:08:43 +08:00
wujiang	f55810ae48	[Update] OpenHuEval examples	2025-02-10 23:08:43 +08:00
wujiang	1e1acf9236	add HuSimpleQA	2025-02-10 21:22:45 +08:00
wujiang	5741e38310	rename models	2025-02-10 17:24:24 +08:00
hoteryoung	c3b0803013	support deepseek-r1-distill-qwen-7b and -llama-8b	2025-02-10 17:24:24 +08:00
hoteryoung	f2c17190c9	enable tested reasoning model	2025-02-10 16:51:48 +08:00
weixingjian	9ae714a577	update hustandard and eval details using data version 250205	2025-02-07 18:51:14 +08:00
weixingjian	9395dc2b60	update humatching and eval details using data version 250205	2025-02-07 14:52:51 +08:00
wujiang	8ec47e2b93	add openai model	2025-02-07 14:43:53 +08:00
wujiang	08712f49f2	update HuProverb config and eval	2025-02-04 16:10:50 +08:00
wujiang	7586186897	add deepseek api models	2025-02-04 15:07:34 +08:00
gaojunyuan	f152ccf127	add HuProverbRea dataset (20250203)	2025-02-04 11:06:10 +08:00
wujiang	794ab7c372	add & update openai models	2025-02-02 15:53:55 +08:00
wujiang	2abf6ca795	update HuMatchingFIB	2025-02-02 14:48:58 +08:00
wujiang	273e609b53	update hu_matching_fib_250126	2025-02-02 13:48:40 +08:00
Hoter Young	3939915349	[Update] Update HuLifeQA primary tags (#6 )	2025-02-01 14:18:05 +08:00
wujiang	d4df622e02	update HuMatchingFIB config and dataset	2025-01-26 13:48:35 +08:00
Hoter Young	116a24632c	[Feature] Add OpenHuEval-HuLifeQA (#4 )	2025-01-24 10:32:17 +08:00
WayneWei	5f72e96d5b	add HuStandardFIB under new paradigm (#3 ) Co-authored-by: weixingjian <weixingjian@pjlab.org.cn>	2025-01-22 19:32:44 +08:00
weixingjian	6527fdf70a	add HuMatchingFIB under new paradigm	2025-01-22 19:32:44 +08:00
Linchen Xiao	a6193b4c02	[Refactor] Code refactoarization (#1831 ) * Update * fix lint * update * fix lint	2025-01-20 19:17:38 +08:00
Linchen Xiao	531643e771	[Feature] Add support for InternLM3 (#1829 ) * update * update * update * update	2025-01-16 14:28:27 +08:00
Zhao Qihao	e039f3efa0	[Feature] Support MMLU-CF Benchmark (#1775 ) * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * Update mmlu-cf * Update mmlu-cf * Update mmlu-cf * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * Remove outside configs --------- Co-authored-by: liushz <qq1791167085@163.com>	2025-01-09 14:11:20 +08:00
Songyang Zhang	f1e50d4bf0	[Update] Update LiveMathBench (#1809 ) * Update LiveMathBench * Update New O1 Evaluation * Update O1 evaluation	2025-01-07 19:16:12 +08:00
Songyang Zhang	8fdb72f567	[Update] Update o1 eval prompt (#1806 ) * Update XML prediction post-process * Update LiveMathBench * Update LiveMathBench * Update New O1 Evaluation	2025-01-07 00:14:32 +08:00
Alexander Lam	f871e80887	[Feature] Add Bradley-Terry Subjective Evaluation method to Arena Hard dataset (#1802 ) * added base_models_abbrs to references (passed from LMEvaluator); added bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer; * added bradleyterry subjective evaluation method to arena_hard dataset	2025-01-03 16:33:43 +08:00
Linchen Xiao	117dc500ad	[Feature] Add Longbenchv2 support (#1801 ) * Create eval_longbenchv2.py * Create longbenchv2_gen.py * Update __init__.py * Create longbenchv2.py * Update datasets_info.py * update * update * update * update * update * update --------- Co-authored-by: abrohamLee <146956824+abrohamLee@users.noreply.github.com>	2025-01-03 12:04:29 +08:00
liushz	9c980cbc62	[Feature] Add LiveStemBench Dataset (#1794 ) * [Fix] Fix vllm max_seq_len parameter transfer * [Fix] Fix vllm max_seq_len parameter transfer * Add livestembench dataset * Add livestembench dataset * Add livestembench dataset * Update livestembench_gen_3e3c50.py * Update eval_livestembench.py * Update eval_livestembench.py	2024-12-31 15:17:39 +08:00
Alexander Lam	dc6035cfcb	[Feature] Added Bradley-Terry subjective evaluation	2024-12-31 11:01:23 +08:00
Songyang Zhang	98435dd98e	[Feature] Update o1 evaluation with JudgeLLM (#1795 ) * Update Generic LLM Evaluator * Update o1 style evaluator	2024-12-30 17:31:00 +08:00
Junnan Liu	8e8d4f1c64	[Feature] Support G-Pass@k and LiveMathBench (#1772 ) * support G-Pass@k and livemathbench * fix bugs * fix comments of GPassKEvaluator * update saved details of GPassKEvaluator * update saved details of GPassKEvaluator * fix eval api configs & update openai_api for ease of debugging * update huggingface path * fix method name of G-Pass@k * fix default value of eval_model_name * refactor G-Pass@k evaluator * log generation params for each backend * fix evaluation resume * add notimplementerror	2024-12-30 16:59:39 +08:00
Linchen Xiao	42b54d6bb8	[Update] Add 0shot CoT config for TheoremQA (#1783 )	2024-12-27 16:17:27 +08:00
Linchen Xiao	ebefffed61	[Update] Update OC academic 202412 (#1771 ) * [Update] Update academic settings * Update * update	2024-12-19 18:07:34 +08:00
Chang Lan	d70100cdf2	[Update] Customizable tokenizer for RULER (#1731 ) * Customizable tokenizer for RULER * Relax requirements	2024-12-19 18:02:11 +08:00
Linchen Xiao	eadbdcb4cb	[Update] Update requirement and deepseek configurations (#1764 )	2024-12-17 10:16:47 +08:00
Alexander Lam	1bd594fc62	[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model (#1751 ) * fix lint issues * updated gitignore * changed infer_order from random to double for the pairwise_judge.py (not changing for pairwise_bt_judge.py * added return statement to CompassArenaBradleyTerrySummarizer to return overall score for each judger model	2024-12-16 13:41:28 +08:00
liushz	c4ce0174fe	[Fix] Fix ChineseSimpleQA max_out_len (#1757 ) * add chinese simpleqa config * add chinese simpleqa config * add chinese simpleqa config * add chinese simpleqa config * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * pdate Csimpleqa * pdate Csimpleqa * Update Csimpleqa --------- Co-authored-by: 明念 <heyancheng.hyc@taobao.com>	2024-12-11 19:51:27 +08:00
Linchen Xiao	bd7b705be4	[Update] Update dataset configuration with no max_out_len (#1754 )	2024-12-11 18:20:29 +08:00
OpenStellarTeam	1a5b3fc11e	Add Chinese SimpleQA config (#1697 ) * add chinese simpleqa config * add chinese simpleqa config * add chinese simpleqa config * add chinese simpleqa config * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * pdate Csimpleqa --------- Co-authored-by: 明念 <heyancheng.hyc@taobao.com> Co-authored-by: liushz <qq1791167085@163.com>	2024-12-11 18:03:39 +08:00
Linchen Xiao	0d26b348e4	[Feature] Add OC academic 2412 (#1750 )	2024-12-10 21:53:06 +08:00
bittersweet1999	54c0fb7a93	[Change] Change Compassarena metric (#1749 ) * fix pip version * fix pip version * fix summarizer bug * fix compassarena * fix compassarena * fix compassarena	2024-12-10 14:45:32 +08:00

1 2 3

132 Commits