OpenCompass

mirror of https://github.com/open-compass/opencompass.git synced 2025-05-30 16:03:24 +08:00

Author	SHA1	Message	Date
Dongsheng Zhu	465e93e10e	[Update] Academic bench llm judge update (#1876 ) * BigCodeBench update * update LCBench * update LCBench 2 * update code * academicBench update * academic bench ifeval&math update * generic_llmjudge_aime_academic_postprocess delete * aime delete * postprocessors update * ifeval delete * update work_dir * linting * linting double-quote-string-fixer * r1-distill out_len update * fix lint --------- Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>	2025-02-24 15:45:24 +08:00
Junnan Liu	046b6f75c6	[Update] Update Greedy Config & README of LiveMathBench (#1862 ) * support omni-math * update config * upload README * Delete opencompass/configs/datasets/omni_math/__init__.py * update greedy config & README of LiveMathBench * update intro for max_out_len * rename livemathbench greedy confi * delete greedy config --------- Co-authored-by: liushz <qq1791167085@163.com>	2025-02-20 19:47:04 +08:00
Linchen Xiao	d7daee6e25	[Update] OpenAI model update, bigcodebench update (#1879 ) * [Update] Openai model update, bigcodebench update * update	2025-02-20 19:33:25 +08:00
Linchen Xiao	27c916661d	[Feature] Math Verify with model post_processor (#1881 ) * update * [Feature] Update model post_processor * update * update * update	2025-02-20 19:32:12 +08:00
zhulinJulia24	bc22749fd8	[CI] update daily test scores (#1870 ) * update * Update daily-run-test.yml * Update dlc.py	2025-02-20 14:08:18 +08:00
bittersweet1999	f407930475	[Feature] Support subjective evaluation for reasoning model (#1868 ) * fix pip version * fix pip version * add subeval for reasoning model * add subeval for reasoning model * update configs * update config * update config * update config * update files	2025-02-20 12:19:46 +08:00
Myhs_phz	68a9838907	[Feature] Add list of supported datasets at html page (#1850 ) * feat dataset-index.yml and stat.py * fix * fix * fix * feat url of paper and config file * doc all supported dataset list * docs zh and en * docs README zh and en * docs new_dataset * docs new_dataset	2025-02-14 16:17:30 +08:00
Dongsheng Zhu	3fd8b4e0cd	[Update] Update BigCodeBench & LCBench load path (#1857 ) * BigCodeBench update * update LCBench * update LCBench 2 * update code	2025-02-08 15:15:47 +08:00
Pablo Hinojosa	9c2e6a192c	[Fix] Update broken links in README.md (#1852 )	2025-02-07 15:41:08 +08:00
zhulinJulia24	ffc04cf650	[CI] Update daily-run-test.yml (#1854 )	2025-02-07 14:40:16 +08:00
Linchen Xiao	862bf78464	[Demo] Internlm3 math500 thinking demo (#1846 ) * [Demo] Add demo for Internlm3 math500 thinking * [Demo] Add demo for Internlm3 math500 thinking * update max_out_len * update start instruction	2025-01-24 14:56:41 +08:00
Shudong Liu	412199f802	[Feature] Support OlympiadBench Benchmark (#1841 ) * Support OlympiadBench Benchmark * Support OlympiadBench Benchmark * Support OlympiadBench Benchmark * update dataset path * Update olmpiadBench * Update olmpiadBench * Update olmpiadBench --------- Co-authored-by: liushz <qq1791167085@163.com>	2025-01-24 10:00:01 +08:00
Junnan Liu	70f2c963d3	[Feature] Support Omni-Math (#1837 ) * support omni-math * update config * upload README * Delete opencompass/configs/datasets/omni_math/__init__.py --------- Co-authored-by: liushz <qq1791167085@163.com>	2025-01-23 18:36:54 +08:00
Linchen Xiao	35ec307c6b	[Bump] Bump version to 0.4.0 (#1838 )	2025-01-22 11:41:46 +08:00
Linchen Xiao	03415b2a66	[Fix] Update max_out_len logic for OpenAI model (#1839 )	2025-01-21 15:46:14 +08:00
Linchen Xiao	a6193b4c02	[Refactor] Code refactoarization (#1831 ) * Update * fix lint * update * fix lint	2025-01-20 19:17:38 +08:00
Jishnu Nair	ffdc917523	[Doc] Installation.md update (#1830 )	2025-01-17 11:08:09 +08:00
Myhs_phz	70da9b7776	[Update] Update method to add dataset in docs (#1827 ) * create new branch * docs new_dataset.md zh * docs new_dataset.md zh and en	2025-01-17 11:07:19 +08:00
Linchen Xiao	531643e771	[Feature] Add support for InternLM3 (#1829 ) * update * update * update * update	2025-01-16 14:28:27 +08:00
Alexander Lam	7f2aeeff26	added predicted win rates reporting to bradley terry subj eval methods with an option to switch between win rates and elo ratings (#1815 )	2025-01-10 18:20:25 +08:00
zhulinJulia24	121d482378	[CI] Fix path conflict (#1814 ) * update * Update pr-run-test.yml * update	2025-01-09 20:16:08 +08:00
zhulinJulia24	abdcee68f6	[CI] Update daily test metrics threshold (#1812 ) * Update daily-run-test.yml * Update pr-run-test.yml * update * update * update * updaet * update * update * update * update * update * update * update --------- Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>	2025-01-09 18:16:24 +08:00
Zhao Qihao	e039f3efa0	[Feature] Support MMLU-CF Benchmark (#1775 ) * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * Update mmlu-cf * Update mmlu-cf * Update mmlu-cf * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * Remove outside configs --------- Co-authored-by: liushz <qq1791167085@163.com>	2025-01-09 14:11:20 +08:00
Songyang Zhang	f1e50d4bf0	[Update] Update LiveMathBench (#1809 ) * Update LiveMathBench * Update New O1 Evaluation * Update O1 evaluation	2025-01-07 19:16:12 +08:00
Songyang Zhang	8fdb72f567	[Update] Update o1 eval prompt (#1806 ) * Update XML prediction post-process * Update LiveMathBench * Update LiveMathBench * Update New O1 Evaluation	2025-01-07 00:14:32 +08:00
Alexander Lam	f871e80887	[Feature] Add Bradley-Terry Subjective Evaluation method to Arena Hard dataset (#1802 ) * added base_models_abbrs to references (passed from LMEvaluator); added bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer; * added bradleyterry subjective evaluation method to arena_hard dataset	2025-01-03 16:33:43 +08:00
Linchen Xiao	117dc500ad	[Feature] Add Longbenchv2 support (#1801 ) * Create eval_longbenchv2.py * Create longbenchv2_gen.py * Update __init__.py * Create longbenchv2.py * Update datasets_info.py * update * update * update * update * update * update --------- Co-authored-by: abrohamLee <146956824+abrohamLee@users.noreply.github.com>	2025-01-03 12:04:29 +08:00
Linchen Xiao	f3220438bc	[BUMP] Bump version to 0.3.9 (#1790 )	2024-12-31 16:52:47 +08:00
liushz	9c980cbc62	[Feature] Add LiveStemBench Dataset (#1794 ) * [Fix] Fix vllm max_seq_len parameter transfer * [Fix] Fix vllm max_seq_len parameter transfer * Add livestembench dataset * Add livestembench dataset * Add livestembench dataset * Update livestembench_gen_3e3c50.py * Update eval_livestembench.py * Update eval_livestembench.py	2024-12-31 15:17:39 +08:00
Songyang Zhang	fc0556ec8e	[Fix] Fix generic_llm_evaluator output_path (#1798 ) * Fix output_path * Add Logger	2024-12-31 13:05:05 +08:00
Alexander Lam	dc6035cfcb	[Feature] Added Bradley-Terry subjective evaluation	2024-12-31 11:01:23 +08:00
Songyang Zhang	98435dd98e	[Feature] Update o1 evaluation with JudgeLLM (#1795 ) * Update Generic LLM Evaluator * Update o1 style evaluator	2024-12-30 17:31:00 +08:00
Junnan Liu	8e8d4f1c64	[Feature] Support G-Pass@k and LiveMathBench (#1772 ) * support G-Pass@k and livemathbench * fix bugs * fix comments of GPassKEvaluator * update saved details of GPassKEvaluator * update saved details of GPassKEvaluator * fix eval api configs & update openai_api for ease of debugging * update huggingface path * fix method name of G-Pass@k * fix default value of eval_model_name * refactor G-Pass@k evaluator * log generation params for each backend * fix evaluation resume * add notimplementerror	2024-12-30 16:59:39 +08:00
Linchen Xiao	42b54d6bb8	[Update] Add 0shot CoT config for TheoremQA (#1783 )	2024-12-27 16:17:27 +08:00
bittersweet1999	357ce8c7a4	[Fix] Fix model summarizer abbr (#1789 ) * fix pip version * fix pip version * fix model summarizer abbr --------- Co-authored-by: root <bittersweet1999>	2024-12-27 14:45:08 +08:00
Linchen Xiao	ae9efb73ad	[CI] Pypi deploy workflow update (#1786 )	2024-12-27 14:08:37 +08:00
Linchen Xiao	f103e90764	[CI] Update deploy python version (#1784 )	2024-12-27 13:35:36 +08:00
zhulinJulia24	ebeb578fbf	[ci] remove daily step retry and update pr score (#1782 ) [ci] remove daily step retry	2024-12-26 16:51:26 +08:00
Linchen Xiao	56eaac6d8f	[Update] Volc status exception handle (#1780 ) * update * update	2024-12-26 15:43:24 +08:00
zhulinJulia24	c48bbde26f	[ci] remove testcase into volc engine (#1777 ) * update * update * update * update * update * update * updaste * update * update * update * update * update * update * update * updaste * update * update * update * update * update * update * update * update * update * Update daily-run-test.yml * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update * update	2024-12-25 17:26:50 +08:00
Linchen Xiao	ebefffed61	[Update] Update OC academic 202412 (#1771 ) * [Update] Update academic settings * Update * update	2024-12-19 18:07:34 +08:00
Chang Lan	d70100cdf2	[Update] Customizable tokenizer for RULER (#1731 ) * Customizable tokenizer for RULER * Relax requirements	2024-12-19 18:02:11 +08:00
Junnan Liu	499302857f	[Fix] Fix Local Runner Params Save Path (#1768 ) * update local runner params save dir * fix remove * fix directory remove * Fix *_params.py by uuid4	2024-12-19 16:07:34 +08:00
Mashiro	9a5adbde6a	[Fix] Fix lark reporter issue (#1769 )	2024-12-18 19:33:06 +08:00
zhulinJulia24	111f817e04	[ci] add fullbench testcase (#1766 ) add volc testcase	2024-12-18 13:24:28 +08:00
bittersweet1999	38dba9919b	[Fix] Fix Subjective summarizer order error (#1767 ) * fix pip version * fix pip version * fix order error	2024-12-18 13:21:31 +08:00
Linchen Xiao	d593bfeac8	[Bump] Bump version to 0.3.8 (#1765 ) * [Bump] Bump version to 0.3.8 * Update README.md	2024-12-17 19:17:18 +08:00
Linchen Xiao	eadbdcb4cb	[Update] Update requirement and deepseek configurations (#1764 )	2024-12-17 10:16:47 +08:00
liushz	5c8e91f329	[Fix] Fix vllm max_seq_len parameter transfer (#1745 ) * [Fix] Fix vllm max_seq_len parameter transfer * [Fix] Fix vllm max_seq_len parameter transfer * Update pr-run-test.yml * Update pr-run-test.yml --------- Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>	2024-12-16 21:44:36 +08:00
Alexander Lam	1bd594fc62	[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model (#1751 ) * fix lint issues * updated gitignore * changed infer_order from random to double for the pairwise_judge.py (not changing for pairwise_bt_judge.py * added return statement to CompassArenaBradleyTerrySummarizer to return overall score for each judger model	2024-12-16 13:41:28 +08:00

1 2 3 4 5 ...

953 Commits