OpenCompass

mirror of https://github.com/open-compass/opencompass.git synced 2025-05-30 16:03:24 +08:00

Author	SHA1	Message	Date
shijinpjlab	828fb745c9	[Dataset] Update dingo 1.5.0 (#2008 ) Co-authored-by: shiin <shijin@pjlab.org.cn>	2025-04-07 17:21:15 +08:00
zhulinJulia24	f982d6278e	[CI] fix baseline score (#2000 ) * update * update * update * update * update * update * update * updaste * update * update * updaste * updaste * update * update * update * update * update * update * update * update	2025-04-03 19:32:36 +08:00
Myhs_phz	9b489e9ea0	[Update] Revert math500 dataset configs (#1998 )	2025-04-03 15:11:02 +08:00
Linchen Xiao	dc8deb6af0	[BUMP] Bump version to 0.4.2 (#1997 )	2025-04-02 17:47:15 +08:00
liushz	32d6859679	[Feature] Add olymmath dataset (#1982 ) * Add olymmath dataset * Add olymmath dataset * Add olymmath dataset * Update olymmath dataset	2025-04-02 17:34:07 +08:00
Linchen Xiao	f66b0b347a	[Update] Requirements update (#1993 )	2025-04-02 12:03:45 +08:00
Dongsheng Zhu	330a6e5ca7	[Update] Add Intervl-8b&38b model configs (#1978 )	2025-04-01 11:51:37 +08:00
Linchen Xiao	0f46c35211	[Bug] Aime2024 config fix (#1974 ) Some checks failed lint / lint (push) Has been cancelled Details * [Bug] Aime2024 config fix * fix	2025-03-25 17:57:11 +08:00
Myhs_phz	6118596362	[Feature] Add recommendation configs for datasets (#1937 ) * feat datasetrefine drop * fix datasets in fullbench_int3 * fix * fix * back * fix * fix and doc * feat * fix hook * fix * fix * fix * fix * fix * fix * fix * fix * fix * doc * fix * fix * Update dataset-index.yml	2025-03-25 14:54:13 +08:00
Linchen Xiao	07930b854a	[Update] Add Korbench config with no max_out_len (#1968 ) Some checks are pending lint / lint (push) Waiting to run Details * Add Korbench no max_out_len * Add Korbench no max_out_len	2025-03-24 18:38:06 +08:00
Myhs_phz	37307fa996	[Update] Add QWQ32b model config (#1959 ) Some checks are pending lint / lint (push) Waiting to run Details * feat qwq-32b * fix * feat phi_4 --------- Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>	2025-03-24 14:51:39 +08:00
Linchen Xiao	db96161a4e	[Update] Add SuperGPQA subset metrics (#1966 )	2025-03-24 14:25:12 +08:00
Linchen Xiao	aa05993922	[Update] Add dataset configurations of no max_out_len (#1967 ) * [Update] Add dataset configurations of no max_out_len * update test torch version * update test torch version * update test torch version * update test torch version	2025-03-24 14:24:12 +08:00
Linchen Xiao	64128916d0	[Update] Increase memory size for CPU job of VOLC Runner (#1962 ) * [Update] Increase memory size for CPU job of VOLC Runner * [Update] Increase memory size for CPU job of VOLC Runner	2025-03-24 11:21:14 +08:00
Dongsheng Zhu	8a5029b121	[Feature] Add MultiPL-E & Code Evaluator (#1963 ) * multiple_code develop * multiple_code update * comments upadate * index upadate	2025-03-21 20:09:25 +08:00
Linchen Xiao	b9de8b0e2b	[Update] Unset disallowed_special token for Openai model (#1960 )	2025-03-18 20:24:07 +08:00
Songyang Zhang	c98599271b	[Update] Update OlympiadBench and Update LLM Judge (#1954 )	2025-03-18 20:15:20 +08:00
Jason Cheung	5d2d253d83	[BUG] Fix model_kwargs pass logic for vllm (#1958 )	2025-03-18 20:08:15 +08:00
Linchen Xiao	0b7f76e193	[Bug] Fix Summarizer logic (#1953 )	2025-03-17 18:25:08 +08:00
Yufeng Zhao	15c825a51a	[Update] Bbeh harmony summarizer added (#1951 ) * bbeh * bbeh * fix_smallbugs_bbeh * removeprint * harmonic * update_summerizer * harmonic-tested * harmonic-tested * clean * clean * cleaned_rebased --------- Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>	2025-03-17 17:19:56 +08:00
Linchen Xiao	854c6bf025	[Update] Update requirement and base evaluator	2025-03-13 20:52:50 +08:00
Linchen Xiao	1c60e3a0f6	[Update] Add configurations for llmjudge dataset (#1940 ) * Add configurations for llmjudge dataset * update	2025-03-13 17:30:04 +08:00
liushz	709bc4af0e	[Update] Add AIME2025 oss info (#1936 ) * Support OlympiadBench Benchmark * Support OlympiadBench Benchmark * Support OlympiadBench Benchmark * update dataset path * Update olmpiadBench * Update olmpiadBench * Update olmpiadBench * Add HLE dataset * Add HLE dataset * Add HLE dataset * Add AIME2025 oss info --------- Co-authored-by: sudanl <sudanl@foxmail.com>	2025-03-12 18:41:16 +08:00
Yufeng Zhao	bc2969dba8	[Feature] Add support for BBEH dataset (#1925 ) * bbeh * bbeh * fix_smallbugs_bbeh * removeprint * results --------- Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>	2025-03-12 10:53:31 +08:00
Kangreen	59e49aedf1	[Feature] Support SuperGPQA (#1924 ) * support supergpqa * remove unnecessary code * remove unnecessary code * Add Readme * Add Readme * fix lint * fix lint * update * update --------- Co-authored-by: mkj3085003 <mkj3085003@gmail.com> Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>	2025-03-11 19:32:08 +08:00
Linchen Xiao	e403fd21be	[Fix] Fix math-verify evaluator (#1917 ) * update * update * update	2025-03-11 17:35:04 +08:00
Linchen Xiao	cbf84fb33c	[Feature] Update LLM Evaluation for MMLU-Pro (#1923 )	2025-03-07 21:01:20 +08:00
Myhs_phz	570c30cf1b	[Fix] Fix CLI option for results persistence (#1920 ) * fix * fix * fix	2025-03-07 18:24:30 +08:00
Myhs_phz	1585c0adbe	[Feature] Evaluation Results Persistence (#1894 ) * feat results_station.py * lint * feat save_to_station * feat result_station.py and lint * feat * fix * fix and lint * fix * fix subjective processing * fix * fix * style function name * lint	2025-03-05 18:33:34 +08:00
Dongsheng Zhu	fff2d51440	[Update] Code evaluation alignment (#1909 ) * code alignment * update oss md5 * bigcodebench update * lint * lint_ * lint yapf	2025-03-04 18:49:38 +08:00
Linchen Xiao	5547fd1592	[Bump] Bump version to 0.4.1	2025-03-04 18:26:14 +08:00
liushz	198c08632e	[Feature] Add HLE (Humanity's Last Exam) dataset (#1902 ) * Support OlympiadBench Benchmark * Support OlympiadBench Benchmark * Support OlympiadBench Benchmark * update dataset path * Update olmpiadBench * Update olmpiadBench * Update olmpiadBench * Add HLE dataset * Add HLE dataset * Add HLE dataset --------- Co-authored-by: sudanl <sudanl@foxmail.com>	2025-03-04 16:42:37 +08:00
Songyang Zhang	c84bc18ac1	[Update] Support OlympiadBench-Math/OmniMath/LiveMathBench-Hard (#1899 ) * [Update] Support OlympiadBench-Math/OmniMath/LiveMathBench-Hard with LLM Verify * Update * Update * Update DeepSeek-R1 example * Update DeepSeek-R1 example * Update DeepSeek-R1 example	2025-03-03 18:56:11 +08:00
Junnan Liu	f0809fe6f6	[Update] Fix Hard Configs With General GPassK (#1906 ) * support dataset repeat and g-pass compute for each evaluator * fix pre-commit errors * delete print * delete gpassk_evaluator and fix potential errors * change `repeat` to `n` * fix `repeat` to `n` in openicl_eval * update doc for multi-run and g-pass * update latex equation in doc * update eng doc for multi-run and g-pass * update datasets.md * update datasets.md * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation in zh_cn user_guides * mmodify pre-commit-zh-cn * recover pre-commit and edit math expr in doc * del [TIP] * del cite tag in doc * del extract_model param in livemathbench config * fix livemathbench hard configs	2025-03-03 18:17:15 +08:00
Linchen Xiao	6a573f671b	[Fix] Fix compatible issue	2025-03-03 15:35:57 +08:00
Junnan Liu	73c80953c6	[Feature] Support Dataset Repeat and G-Pass Compute for Each Evaluator (#1886 ) * support dataset repeat and g-pass compute for each evaluator * fix pre-commit errors * delete print * delete gpassk_evaluator and fix potential errors * change `repeat` to `n` * fix `repeat` to `n` in openicl_eval * update doc for multi-run and g-pass * update latex equation in doc * update eng doc for multi-run and g-pass * update datasets.md * update datasets.md * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation * fix multi-line equation in zh_cn user_guides * mmodify pre-commit-zh-cn * recover pre-commit and edit math expr in doc * del [TIP] * del cite tag in doc * del extract_model param in livemathbench config	2025-02-26 19:43:12 +08:00
Linchen Xiao	bdb2d46f59	[Feature] Add general math, llm judge evaluator (#1892 ) * update_doc * update llm_judge * update README * update md file name	2025-02-26 15:08:50 +08:00
Songyang Zhang	fd6fbf01a2	[Update] Support AIME-24 Evaluation for DeepSeek-R1 series (#1888 ) * Update * Update * Update * Update	2025-02-25 20:34:41 +08:00
Junnan Liu	22a33d8759	[Update] Update LiveMathBench Hard Configs (#1826 ) * support G-Pass@k and livemathbench * fix bugs * fix comments of GPassKEvaluator * update saved details of GPassKEvaluator * update saved details of GPassKEvaluator * fix eval api configs & update openai_api for ease of debugging * update huggingface path * fix method name of G-Pass@k * fix default value of eval_model_name * refactor G-Pass@k evaluator * log generation params for each backend * fix evaluation resume * add notimplementerror * update livemathbench-hard configs * remove max_out_len from livemathbench_hard_greedy_gen_9befbf.py * remove max_out_len from livemathbench_hard_gen_9befbf.py * rename livemathbench_hard_gen_9befbf.py to livemathbench_hard_gen_353ae7.py * rename livemathbench_hard_greedy_gen_9befbf.py to livemathbench_hard_greedy_gen_353ae7.py * update livemathbench_gen_9befbf.py * remove whitespace * upload livemathbench hard configs	2025-02-25 17:24:36 +08:00
Dongsheng Zhu	465e93e10e	[Update] Academic bench llm judge update (#1876 ) * BigCodeBench update * update LCBench * update LCBench 2 * update code * academicBench update * academic bench ifeval&math update * generic_llmjudge_aime_academic_postprocess delete * aime delete * postprocessors update * ifeval delete * update work_dir * linting * linting double-quote-string-fixer * r1-distill out_len update * fix lint --------- Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>	2025-02-24 15:45:24 +08:00
Junnan Liu	046b6f75c6	[Update] Update Greedy Config & README of LiveMathBench (#1862 ) * support omni-math * update config * upload README * Delete opencompass/configs/datasets/omni_math/__init__.py * update greedy config & README of LiveMathBench * update intro for max_out_len * rename livemathbench greedy confi * delete greedy config --------- Co-authored-by: liushz <qq1791167085@163.com>	2025-02-20 19:47:04 +08:00
Linchen Xiao	d7daee6e25	[Update] OpenAI model update, bigcodebench update (#1879 ) * [Update] Openai model update, bigcodebench update * update	2025-02-20 19:33:25 +08:00
Linchen Xiao	27c916661d	[Feature] Math Verify with model post_processor (#1881 ) * update * [Feature] Update model post_processor * update * update * update	2025-02-20 19:32:12 +08:00
zhulinJulia24	bc22749fd8	[CI] update daily test scores (#1870 ) * update * Update daily-run-test.yml * Update dlc.py	2025-02-20 14:08:18 +08:00
bittersweet1999	f407930475	[Feature] Support subjective evaluation for reasoning model (#1868 ) * fix pip version * fix pip version * add subeval for reasoning model * add subeval for reasoning model * update configs * update config * update config * update config * update files	2025-02-20 12:19:46 +08:00
Dongsheng Zhu	3fd8b4e0cd	[Update] Update BigCodeBench & LCBench load path (#1857 ) * BigCodeBench update * update LCBench * update LCBench 2 * update code	2025-02-08 15:15:47 +08:00
Shudong Liu	412199f802	[Feature] Support OlympiadBench Benchmark (#1841 ) * Support OlympiadBench Benchmark * Support OlympiadBench Benchmark * Support OlympiadBench Benchmark * update dataset path * Update olmpiadBench * Update olmpiadBench * Update olmpiadBench --------- Co-authored-by: liushz <qq1791167085@163.com>	2025-01-24 10:00:01 +08:00
Junnan Liu	70f2c963d3	[Feature] Support Omni-Math (#1837 ) * support omni-math * update config * upload README * Delete opencompass/configs/datasets/omni_math/__init__.py --------- Co-authored-by: liushz <qq1791167085@163.com>	2025-01-23 18:36:54 +08:00
Linchen Xiao	35ec307c6b	[Bump] Bump version to 0.4.0 (#1838 )	2025-01-22 11:41:46 +08:00
Linchen Xiao	03415b2a66	[Fix] Update max_out_len logic for OpenAI model (#1839 )	2025-01-21 15:46:14 +08:00
Linchen Xiao	a6193b4c02	[Refactor] Code refactoarization (#1831 ) * Update * fix lint * update * fix lint	2025-01-20 19:17:38 +08:00
Linchen Xiao	531643e771	[Feature] Add support for InternLM3 (#1829 ) * update * update * update * update	2025-01-16 14:28:27 +08:00
Alexander Lam	7f2aeeff26	added predicted win rates reporting to bradley terry subj eval methods with an option to switch between win rates and elo ratings (#1815 )	2025-01-10 18:20:25 +08:00
Zhao Qihao	e039f3efa0	[Feature] Support MMLU-CF Benchmark (#1775 ) * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * Update mmlu-cf * Update mmlu-cf * Update mmlu-cf * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * [Feature] Support MMLU-CF Benchmark * Remove outside configs --------- Co-authored-by: liushz <qq1791167085@163.com>	2025-01-09 14:11:20 +08:00
Songyang Zhang	f1e50d4bf0	[Update] Update LiveMathBench (#1809 ) * Update LiveMathBench * Update New O1 Evaluation * Update O1 evaluation	2025-01-07 19:16:12 +08:00
Songyang Zhang	8fdb72f567	[Update] Update o1 eval prompt (#1806 ) * Update XML prediction post-process * Update LiveMathBench * Update LiveMathBench * Update New O1 Evaluation	2025-01-07 00:14:32 +08:00
Alexander Lam	f871e80887	[Feature] Add Bradley-Terry Subjective Evaluation method to Arena Hard dataset (#1802 ) * added base_models_abbrs to references (passed from LMEvaluator); added bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer; * added bradleyterry subjective evaluation method to arena_hard dataset	2025-01-03 16:33:43 +08:00
Linchen Xiao	117dc500ad	[Feature] Add Longbenchv2 support (#1801 ) * Create eval_longbenchv2.py * Create longbenchv2_gen.py * Update __init__.py * Create longbenchv2.py * Update datasets_info.py * update * update * update * update * update * update --------- Co-authored-by: abrohamLee <146956824+abrohamLee@users.noreply.github.com>	2025-01-03 12:04:29 +08:00
Linchen Xiao	f3220438bc	[BUMP] Bump version to 0.3.9 (#1790 )	2024-12-31 16:52:47 +08:00
liushz	9c980cbc62	[Feature] Add LiveStemBench Dataset (#1794 ) * [Fix] Fix vllm max_seq_len parameter transfer * [Fix] Fix vllm max_seq_len parameter transfer * Add livestembench dataset * Add livestembench dataset * Add livestembench dataset * Update livestembench_gen_3e3c50.py * Update eval_livestembench.py * Update eval_livestembench.py	2024-12-31 15:17:39 +08:00
Songyang Zhang	fc0556ec8e	[Fix] Fix generic_llm_evaluator output_path (#1798 ) * Fix output_path * Add Logger	2024-12-31 13:05:05 +08:00
Alexander Lam	dc6035cfcb	[Feature] Added Bradley-Terry subjective evaluation	2024-12-31 11:01:23 +08:00
Songyang Zhang	98435dd98e	[Feature] Update o1 evaluation with JudgeLLM (#1795 ) * Update Generic LLM Evaluator * Update o1 style evaluator	2024-12-30 17:31:00 +08:00
Junnan Liu	8e8d4f1c64	[Feature] Support G-Pass@k and LiveMathBench (#1772 ) * support G-Pass@k and livemathbench * fix bugs * fix comments of GPassKEvaluator * update saved details of GPassKEvaluator * update saved details of GPassKEvaluator * fix eval api configs & update openai_api for ease of debugging * update huggingface path * fix method name of G-Pass@k * fix default value of eval_model_name * refactor G-Pass@k evaluator * log generation params for each backend * fix evaluation resume * add notimplementerror	2024-12-30 16:59:39 +08:00
Linchen Xiao	42b54d6bb8	[Update] Add 0shot CoT config for TheoremQA (#1783 )	2024-12-27 16:17:27 +08:00
bittersweet1999	357ce8c7a4	[Fix] Fix model summarizer abbr (#1789 ) * fix pip version * fix pip version * fix model summarizer abbr --------- Co-authored-by: root <bittersweet1999>	2024-12-27 14:45:08 +08:00
Linchen Xiao	56eaac6d8f	[Update] Volc status exception handle (#1780 ) * update * update	2024-12-26 15:43:24 +08:00
Linchen Xiao	ebefffed61	[Update] Update OC academic 202412 (#1771 ) * [Update] Update academic settings * Update * update	2024-12-19 18:07:34 +08:00
Chang Lan	d70100cdf2	[Update] Customizable tokenizer for RULER (#1731 ) * Customizable tokenizer for RULER * Relax requirements	2024-12-19 18:02:11 +08:00
Junnan Liu	499302857f	[Fix] Fix Local Runner Params Save Path (#1768 ) * update local runner params save dir * fix remove * fix directory remove * Fix *_params.py by uuid4	2024-12-19 16:07:34 +08:00
Mashiro	9a5adbde6a	[Fix] Fix lark reporter issue (#1769 )	2024-12-18 19:33:06 +08:00
bittersweet1999	38dba9919b	[Fix] Fix Subjective summarizer order error (#1767 ) * fix pip version * fix pip version * fix order error	2024-12-18 13:21:31 +08:00
Linchen Xiao	d593bfeac8	[Bump] Bump version to 0.3.8 (#1765 ) * [Bump] Bump version to 0.3.8 * Update README.md	2024-12-17 19:17:18 +08:00
Linchen Xiao	eadbdcb4cb	[Update] Update requirement and deepseek configurations (#1764 )	2024-12-17 10:16:47 +08:00
liushz	5c8e91f329	[Fix] Fix vllm max_seq_len parameter transfer (#1745 ) * [Fix] Fix vllm max_seq_len parameter transfer * [Fix] Fix vllm max_seq_len parameter transfer * Update pr-run-test.yml * Update pr-run-test.yml --------- Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>	2024-12-16 21:44:36 +08:00
Alexander Lam	1bd594fc62	[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model (#1751 ) * fix lint issues * updated gitignore * changed infer_order from random to double for the pairwise_judge.py (not changing for pairwise_bt_judge.py * added return statement to CompassArenaBradleyTerrySummarizer to return overall score for each judger model	2024-12-16 13:41:28 +08:00
zhulinJulia24	aeded4c4db	add new dataset summerizer (#1758 ) add new dataset summerizer	2024-12-13 09:50:43 +08:00
zhulinJulia24	a1c00cc8b7	[ci] add common_summarizer return (#1724 ) * Update common_summarizer.py * Update common_summarizer.py	2024-12-11 20:38:32 +08:00
liushz	c4ce0174fe	[Fix] Fix ChineseSimpleQA max_out_len (#1757 ) * add chinese simpleqa config * add chinese simpleqa config * add chinese simpleqa config * add chinese simpleqa config * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * pdate Csimpleqa * pdate Csimpleqa * Update Csimpleqa --------- Co-authored-by: 明念 <heyancheng.hyc@taobao.com>	2024-12-11 19:51:27 +08:00
Linchen Xiao	bd7b705be4	[Update] Update dataset configuration with no max_out_len (#1754 )	2024-12-11 18:20:29 +08:00
OpenStellarTeam	1a5b3fc11e	Add Chinese SimpleQA config (#1697 ) * add chinese simpleqa config * add chinese simpleqa config * add chinese simpleqa config * add chinese simpleqa config * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * Update CsimpleQA * pdate Csimpleqa --------- Co-authored-by: 明念 <heyancheng.hyc@taobao.com> Co-authored-by: liushz <qq1791167085@163.com>	2024-12-11 18:03:39 +08:00
Linchen Xiao	0d26b348e4	[Feature] Add OC academic 2412 (#1750 )	2024-12-10 21:53:06 +08:00
bittersweet1999	54c0fb7a93	[Change] Change Compassarena metric (#1749 ) * fix pip version * fix pip version * fix summarizer bug * fix compassarena * fix compassarena * fix compassarena	2024-12-10 14:45:32 +08:00
Songyang Zhang	0d8df541bc	[Update] Update O1-style Benchmark and Prompts (#1742 ) * Update JuderBench * Support O1-style Prompts * Update Code * Update OpenAI * Update BigCodeBench * Update BigCodeBench * Update BigCodeBench * Update BigCodeBench * Update BigCodeBench * Update * Update * Update * Update	2024-12-09 13:48:56 +08:00
Junnan Liu	f333be177c	[Update] Add MATH500 & AIME2024 to LiveMathBench (#1741 ) * upload dataset definitions & configs * add single dataset split specific metrics * add k-pass@threshold & MATH500 * update std computation & k-pass computation * add AIME224 * update README	2024-12-06 14:36:49 +08:00
bittersweet1999	08d63b5bf3	[Fix] Fix error in subjective default summarizer (#1740 ) * fix pip version * fix pip version * fix summarizer bug	2024-12-06 11:03:53 +08:00
Songyang Zhang	fb43dd1906	[Update] Update Skywork/Qwen-QwQ (#1728 ) * Update JuderBench * Support O1-style Prompts * Update Code * Update OpenAI * Update BigCodeBench * Update BigCodeBench * Update BigCodeBench * Update BigCodeBench * Update BigCodeBench * Update	2024-12-05 19:30:43 +08:00
Junnan Liu	6181ac1122	[Update] Update LiveMathBench Evaluation to Support Single Dataset Split Metric Computation (#1730 ) * upload dataset definitions & configs * add single dataset split specific metrics * add k-pass@threshold & MATH500	2024-12-05 16:54:16 +08:00
Linchen Xiao	ac23f0ce1f	[Update] Update init file for Korbench (#1737 )	2024-12-05 11:26:00 +08:00
Yufeng Zhao	4d773904d4	[Update] Korbench readme supplementation (#1734 ) * renewed * readme --------- Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>	2024-12-05 11:24:35 +08:00
Linchen Xiao	a011be6798	[Feature] DLC runner Lark report (#1735 ) * [Bump] Bump version to 0.3.7 * DLC lark report update	2024-12-04 18:03:12 +08:00
Linchen Xiao	e2a290fd46	[Bump] Bump version to 0.3.7 (#1733 )	2024-12-03 19:34:57 +08:00
Yufeng Zhao	98c4666d65	[Update] Update Korbench dataset abbr (#1729 ) Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>	2024-12-02 16:20:58 +08:00
Linchen Xiao	9de27b4d85	[Update] Update max_out_len for datasets (#1726 ) * [Update] Update max_out_len for datasets * Update eval_regression_chat_objective_fullbench.py * Update eval_regression_chat.py * Update eval_regression_chat.py * Update oc_score_baseline_fullbench.yaml --------- Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>	2024-12-02 11:42:07 +08:00
Junnan Liu	fe6d76fb13	[Feature] Support LiveMathBench (#1727 )	2024-11-30 00:07:19 +08:00
liushz	b063779034	[Fix] Update P-MMEVAL OSS data (#1722 ) * Update with PMMEval * Update * Update __init__.py * Fix Bugs * Delete .pre-commit-config.yaml * Pull merge * Fix pmmeval_gen config * Update P-MMEVAL data --------- Co-authored-by: wanyu <wanyu2018umac@gmail.com> Co-authored-by: wanyu2018umac <42405907+wanyu2018umac@users.noreply.github.com>	2024-11-28 20:55:46 +08:00
liushz	c437135fad	[Feature] Add Openai Simpleqa dataset (#1720 ) * Add Openai SimpleQA dataset * Add Openai SimpleQA dataset * Add Openai SimpleQA dataset * Update eval_simpleqa.py --------- Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>	2024-11-28 19:16:07 +08:00
liushz	06ab27861e	[Fix] Fix pmmeval_gen config (#1719 ) * Update with PMMEval * Update * Update __init__.py * Fix Bugs * Delete .pre-commit-config.yaml * Pull merge * Fix pmmeval_gen config --------- Co-authored-by: wanyu <wanyu2018umac@gmail.com> Co-authored-by: wanyu2018umac <42405907+wanyu2018umac@users.noreply.github.com>	2024-11-28 11:53:36 +08:00
wanyu2018umac	90efcf2216	[Feature] Add P-MMEval (#1714 ) * Update with PMMEval * Update * Update __init__.py * Fix Bugs * Delete .pre-commit-config.yaml * Pull merge --------- Co-authored-by: liushz <qq1791167085@163.com>	2024-11-27 21:26:18 +08:00
Junnan Liu	f7dbe6bb7d	[Feature] Add Arc Prize Public Evaluation (#1690 ) * support arc prize * update arc-prize dataset info & update arc-prize evaluation performance	2024-11-27 15:44:41 +08:00

1 2 3 4 5 ...

682 Commits