OpenCompass

mirror of https://github.com/open-compass/opencompass.git synced 2025-05-30 16:03:24 +08:00

Author	SHA1	Message	Date
Alexander Lam	7f2aeeff26	added predicted win rates reporting to bradley terry subj eval methods with an option to switch between win rates and elo ratings (#1815 )	2025-01-10 18:20:25 +08:00
Alexander Lam	dc6035cfcb	[Feature] Added Bradley-Terry subjective evaluation	2024-12-31 11:01:23 +08:00
Songyang Zhang	98435dd98e	[Feature] Update o1 evaluation with JudgeLLM (#1795 ) * Update Generic LLM Evaluator * Update o1 style evaluator	2024-12-30 17:31:00 +08:00
bittersweet1999	357ce8c7a4	[Fix] Fix model summarizer abbr (#1789 ) * fix pip version * fix pip version * fix model summarizer abbr --------- Co-authored-by: root <bittersweet1999>	2024-12-27 14:45:08 +08:00
bittersweet1999	38dba9919b	[Fix] Fix Subjective summarizer order error (#1767 ) * fix pip version * fix pip version * fix order error	2024-12-18 13:21:31 +08:00
Alexander Lam	1bd594fc62	[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model (#1751 ) * fix lint issues * updated gitignore * changed infer_order from random to double for the pairwise_judge.py (not changing for pairwise_bt_judge.py * added return statement to CompassArenaBradleyTerrySummarizer to return overall score for each judger model	2024-12-16 13:41:28 +08:00
zhulinJulia24	aeded4c4db	add new dataset summerizer (#1758 ) add new dataset summerizer	2024-12-13 09:50:43 +08:00
zhulinJulia24	a1c00cc8b7	[ci] add common_summarizer return (#1724 ) * Update common_summarizer.py * Update common_summarizer.py	2024-12-11 20:38:32 +08:00
bittersweet1999	08d63b5bf3	[Fix] Fix error in subjective default summarizer (#1740 ) * fix pip version * fix pip version * fix summarizer bug	2024-12-06 11:03:53 +08:00
Chang Cheng	fd7aa83c01	[Update] Update DLC Runner(#1662 ) * push interntrain hard code * push interntrain hard code * remove redundant post process --------- Co-authored-by: changcheng <changcheng@pjlab.org.cb> Co-authored-by: changcheng <changcheng@pjlab.org.cn>	2024-11-07 15:45:35 +08:00
Linchen Xiao	df57c08ccf	[Feature] Update Models, Summarizers (#1600 )	2024-10-29 18:37:15 +08:00
BigDong	2542bc6907	[Feature] Support results saving as md format table (#1638 )	2024-10-25 15:50:33 +08:00
Linchen Xiao	be3c06a158	[Fix] Update common summarizer regex extraction (#1631 )	2024-10-22 14:35:45 +08:00
Haoran Que	4fe251729b	Upload HelloBench (#1607 ) * upload hellobench * update hellobench * update readme.md * update eval_hellobench.py * update lastest --------- Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>	2024-10-15 17:11:37 +08:00
bittersweet1999	fa54aa62f6	[Feature] Add Judgerbench and reorg subeval (#1593 ) * fix pip version * fix pip version * update (#1522) Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn> * [Feature] Update Models (#1518) * Update Models * Update * Update humanevalx * Update * Update * [Feature] Dataset prompts update for ARC, BoolQ, Race (#1527) add judgerbench and reorg sub add judgerbench and reorg subeval add judgerbench and reorg subeval * add judgerbench and reorg subeval * add judgerbench and reorg subeval * add judgerbench and reorg subeval * add judgerbench and reorg subeval --------- Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com> Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn> Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com> Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>	2024-10-15 16:36:05 +08:00
bittersweet1999	3f7a3730d7	[Fix] fix Flames (#1599 ) * fix pip version * fix pip version * fix flames * fix flames	2024-10-12 14:34:59 +08:00
zhulinJulia24	87df8a73a3	[CI] add a common summarizer for qabench summarizer (#1545 ) * update * update * update --------- Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>	2024-09-25 13:40:47 +08:00
bittersweet1999	7c7fa36235	[Feature] add support for internal Followbench (#1511 ) * fix pip version * fix pip version * add internal followbench * add internal followbench * fix lint * fix lint	2024-09-11 13:32:34 +08:00
bittersweet1999	c2bcd8725e	[Fix] Fix wildbench (#1508 ) * fix pip version * fix pip version * fix_wildbench	2024-09-10 17:35:07 +08:00
bittersweet1999	ce7f4853ce	[Fix] Sub summarizer order fix (#1426 ) * fix pip version * fix pip version * fix sub summarizer order * fix order	2024-08-15 21:08:18 +08:00
Linchen Xiao	8e55c9c6ee	[Update] Compassbench v1.3 (#1396 ) * stash files * compassbench subjective evaluation added * evaluation update * fix lint * update docs * Update lint * changes saved * changes saved * CompassBench subjective summarizer added (#1349) * subjective summarizer added * fix lint [Fix] Fix MathBench (#1351) Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn> [Update] Update model support list (#1353) * fix pip version * fix pip version * update model support subjective summarizer updated knowledge, math objective done (data need update) remove secrets objective changes saved knowledge data added * secrets removed * changed added * summarizer modified * summarizer modified * compassbench coding added * fix lint * objective summarizer updated * compass_bench_v1.3 updated * update files in config folder * remove unused model * lcbench modified * removed model evaluation configs * remove duplicated sdk implementation --------- Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn>	2024-08-12 19:09:19 +08:00
Songyang Zhang	704853e5e7	[Feature] Update pip install (#1324 ) * [Feature] Update pip install * Update Configuration * Update * Update * Update * Update Internal Config * Update collect env	2024-07-29 18:32:50 +08:00
jxd	12b84aeb3b	[Feature] Update CHARM Memeorziation (#1230 ) * update gemini api and add gemini models * add openai models * update CHARM evaluation * add CHARM memorization tasks * add CharmMemSummarizer (output eval details for memorization-independent reasoning analysis * update CHARM readme --------- Co-authored-by: wujiang <wujiang@pjlab.org.cn>	2024-07-26 18:42:30 +08:00
WANG WENJIN	0aad8199c7	Fix the summary error in subjective.py (#1363 )	2024-07-25 18:36:13 +08:00
Linchen Xiao	8127fc3518	CompassBench subjective summarizer added (#1349 ) * subjective summarizer added * fix lint	2024-07-23 12:29:57 +08:00
Mo Li	104bddf647	[Doc] Update NeedleBench Docs (#1330 ) * update needlebench docs * update model_name_mapping dict * update README * Update README_zh-CN.md --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>	2024-07-18 13:16:19 +08:00
bittersweet1999	8e7ad2e981	[Fix] add bc for alignbench summarizer (#1306 ) * fix pip version * fix pip version * fix alignbench * fix import error	2024-07-12 11:06:20 +08:00
bittersweet1999	68ca48496b	[Refactor] Reorganize subjective eval (#1284 ) * fix pip version * fix pip version * reorganize subjective eval * reorg sub * reorg subeval * reorg subeval * update subjective doc * reorg subeval * reorg subeval	2024-07-05 22:11:37 +08:00
Fengzhe Zhou	a32f21a356	[Sync] Sync with internal codes 2024.06.28 (#1279 )	2024-06-28 14:16:34 +08:00
klein	1fa62c4a42	Support wildbench (#1266 ) Co-authored-by: Leymore <zfz-960727@163.com>	2024-06-24 13:16:27 +08:00
bittersweet1999	982e024540	[Feature] add dataset Fofo (#1224 ) * add fofo dataset * add dataset fofo	2024-06-06 11:40:48 +08:00
Xingyuan Bu	02a0a4e857	MT-Bench-101 (#1215 ) * add mt-bench-101 * add readme and requirements * add mt-bench-101 data * Update readme_mtbench101.md * update readme * update leaderboard * fix typo * Update readme_mtbench101.md * fit newest opencompass * update readme.md * mtbench101 to opencompass * mtbench101 to opencompass * for code review * for code review * for code review * hook * hook --------- Co-authored-by: liujie <ljie@buaa.edu.cn>	2024-06-03 14:52:12 +08:00
bittersweet1999	7c381e5be8	[Fix] fix summarizer (#1217 ) * fix summarizer * fix summarizer	2024-05-31 11:40:47 +08:00
Fengzhe Zhou	a77b8a5cec	[Sync] format (#1214 )	2024-05-30 00:21:58 +08:00
Fengzhe Zhou	2954913d9b	[Sync] bump version (#1204 )	2024-05-28 23:09:59 +08:00
bittersweet1999	8a8987be0b	fix arenahard summarizer (#1154 ) Co-authored-by: Leymore <zfz-960727@163.com>	2024-05-15 13:31:29 +08:00
Fengzhe Zhou	7505b3cadf	[Feature] Add huggingface apply_chat_template (#1098 ) * add TheoremQA with 5-shot * add huggingface_above_v4_33 classes * use num_worker partitioner in cli * update theoremqa * update TheoremQA * add TheoremQA * rename theoremqa -> TheoremQA * update TheoremQA output path * rewrite many model configs * update huggingface * further update * refine configs * update configs * update configs * add configs/eval_llama3_instruct.py * add summarizer multi faceted * update bbh datasets * update configs/models/hf_llama/lmdeploy_llama3_8b_instruct.py * rename class * update readme * update hf above v4.33	2024-05-14 14:50:16 +08:00
Mo Li	6c711cb262	[Fix] Fix Needlebench Summarizer (#1143 ) * update few-shot example * add 128k	2024-05-13 15:59:34 +08:00
Alexander Lam	35c94d0cde	[Feature] Adding support for LLM Compression Evaluation (#1108 ) * fixed formatting based on pre-commit tests * fixed typo in comments; reduced the number of models in the eval config * fixed a bug in LLMCompressionDataset, where setting samples=None would result in passing test[:None] to load_dataset * removed unnecessary variable in _format_table_pivot; changed lark_reporter message to English	2024-04-30 10:51:01 +08:00
liushz	a6f67e1a65	[Fix] Fix Math Evaluation with Judge Model Evaluator & Add README (#1103 ) * Add Math Evaluation with Judge Model Evaluator * Add Math Evaluation with Judge Model Evaluator * Add Math Evaluation with Judge Model Evaluator * Add Math Evaluation with Judge Model Evaluator * Fix Llama-3 meta template * Fix MATH with JudgeLM Evaluation * Fix MATH with JudgeLM Evaluation * Fix MATH with JudgeLM Evaluation * Fix MATH with JudgeLM Evaluation --------- Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>	2024-04-28 21:58:58 +08:00
Yggdrasill7D6	58a57a4c45	[Feature] add support for Flames datasets (#1093 ) * add flames datasets * fix lint * rm quota * add judgemodel info and fix os path * support flames dataset * support flames dataset --------- Co-authored-by: bittersweet1999 <1487910649@qq.com>	2024-04-28 18:56:24 +08:00
klein	e4830a6926	Update CIBench (#1089 ) * modify the requirements/runtime.txt: numpy==1.23.4 --> numpy>=1.23.4 * update cibench: dataset and evluation * cibench summarizer bug * update cibench * move extract_code import --------- Co-authored-by: zhangchuyu@pjlab.org.cn <zhangchuyu@pjlab.org.cn> Co-authored-by: Leymore <zfz-960727@163.com>	2024-04-26 18:46:02 +08:00
bittersweet1999	e404b72c52	[Feature] support arenahard evaluation (#1096 ) * support arenahard * support arenahard * support arenahard	2024-04-26 15:42:00 +08:00
bittersweet1999	6ba1c4937d	[Feature] Support Math evaluation via judgemodel (#1094 ) * support openai math evaluation * support openai math evaluation * support openai math evaluation * support math llm judge * support math llm judge	2024-04-26 14:56:23 +08:00
bittersweet1999	6f98c8d9ab	[Fix] Fix MultiRound Subjective Evaluation(#1043 ) * fix multiround * fix	2024-04-22 12:06:03 +08:00
Fengzhe Zhou	8c85edd1cd	[Sync] deprecate old mbpps (#1064 )	2024-04-19 20:49:46 +08:00
Fengzhe Zhou	b39f501563	[Sync] update taco (#1030 )	2024-04-09 17:50:23 +08:00
Mo Li	16f29b25f1	[Fix] Simplify needlebench summarizer (#1024 ) * Conflicts: configs/summarizers/needlebench.py * fix lint problems	2024-04-07 17:51:13 +08:00
bittersweet1999	2d4e559763	[Feature] Add multi-model judge and fix some problems (#1016 ) * support multi-model judge and moe judge * test_moe * test_moe * test * add moe judge * support multi-judge-model	2024-04-02 11:52:06 +08:00
bittersweet1999	848e7c8a76	[fix] add different temp for different question in mtbench (#954 ) * add temp for mtbench * add document for mtbench * add document for mtbench	2024-03-11 17:24:39 +08:00

1 2

83 Commits