Commit Graph

894 Commits

Author SHA1 Message Date
Linchen Xiao
dc8deb6af0
[BUMP] Bump version to 0.4.2 (#1997) 2025-04-02 17:47:15 +08:00
liushz
32d6859679
[Feature] Add olymmath dataset (#1982)
* Add olymmath dataset

* Add olymmath dataset

* Add olymmath dataset

* Update olymmath dataset
2025-04-02 17:34:07 +08:00
zhulinJulia24
97236c8e97
[CI] Fix baseline score (#1996)
* update

* update

* update

* update
2025-04-02 14:25:16 +08:00
Linchen Xiao
f66b0b347a
[Update] Requirements update (#1993) 2025-04-02 12:03:45 +08:00
Dongsheng Zhu
330a6e5ca7
[Update] Add Intervl-8b&38b model configs (#1978) 2025-04-01 11:51:37 +08:00
Myhs_phz
f71eb78c72
[Doc] Add TBD Token in Datasets Statistics (#1986)
* feat

* doc

* doc

* doc

* doc
2025-03-31 19:08:55 +08:00
Linchen Xiao
0f46c35211
[Bug] Aime2024 config fix (#1974)
Some checks failed
lint / lint (push) Has been cancelled
* [Bug] Aime2024 config fix

* fix
2025-03-25 17:57:11 +08:00
Myhs_phz
6118596362
[Feature] Add recommendation configs for datasets (#1937)
* feat datasetrefine drop

* fix datasets in fullbench_int3

* fix

* fix

* back

* fix

* fix and doc

* feat

* fix hook

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* doc

* fix

* fix

* Update dataset-index.yml
2025-03-25 14:54:13 +08:00
Linchen Xiao
07930b854a
[Update] Add Korbench config with no max_out_len (#1968)
Some checks are pending
lint / lint (push) Waiting to run
* Add Korbench no max_out_len

* Add Korbench no max_out_len
2025-03-24 18:38:06 +08:00
Myhs_phz
37307fa996
[Update] Add QWQ32b model config (#1959)
Some checks are pending
lint / lint (push) Waiting to run
* feat qwq-32b

* fix

* feat phi_4

---------

Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2025-03-24 14:51:39 +08:00
Linchen Xiao
db96161a4e
[Update] Add SuperGPQA subset metrics (#1966) 2025-03-24 14:25:12 +08:00
Linchen Xiao
aa05993922
[Update] Add dataset configurations of no max_out_len (#1967)
* [Update] Add dataset configurations of no max_out_len

* update test torch version

* update test torch version

* update test torch version

* update test torch version
2025-03-24 14:24:12 +08:00
Linchen Xiao
64128916d0
[Update] Increase memory size for CPU job of VOLC Runner (#1962)
* [Update] Increase memory size for CPU job of VOLC Runner

* [Update] Increase memory size for CPU job of VOLC Runner
2025-03-24 11:21:14 +08:00
Dongsheng Zhu
8a5029b121
[Feature] Add MultiPL-E & Code Evaluator (#1963)
* multiple_code develop

* multiple_code update

* comments upadate

* index upadate
2025-03-21 20:09:25 +08:00
Linchen Xiao
b9de8b0e2b
[Update] Unset disallowed_special token for Openai model (#1960) 2025-03-18 20:24:07 +08:00
Songyang Zhang
c98599271b
[Update] Update OlympiadBench and Update LLM Judge (#1954) 2025-03-18 20:15:20 +08:00
Jason Cheung
5d2d253d83
[BUG] Fix model_kwargs pass logic for vllm (#1958) 2025-03-18 20:08:15 +08:00
Linchen Xiao
0b7f76e193
[Bug] Fix Summarizer logic (#1953) 2025-03-17 18:25:08 +08:00
Yufeng Zhao
15c825a51a
[Update] Bbeh harmony summarizer added (#1951)
* bbeh

* bbeh

* fix_smallbugs_bbeh

* removeprint

* harmonic

* update_summerizer

* harmonic-tested

* harmonic-tested

* clean

* clean

* cleaned_rebased

---------

Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2025-03-17 17:19:56 +08:00
Linchen Xiao
854c6bf025
[Update] Update requirement and base evaluator 2025-03-13 20:52:50 +08:00
Linchen Xiao
1c60e3a0f6
[Update] Add configurations for llmjudge dataset (#1940)
* Add configurations for llmjudge dataset

* update
2025-03-13 17:30:04 +08:00
liushz
709bc4af0e
[Update] Add AIME2025 oss info (#1936)
* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* update dataset path

* Update olmpiadBench

* Update olmpiadBench

* Update olmpiadBench

* Add HLE dataset

* Add HLE dataset

* Add HLE dataset

* Add AIME2025 oss info

---------

Co-authored-by: sudanl <sudanl@foxmail.com>
2025-03-12 18:41:16 +08:00
Yufeng Zhao
bc2969dba8
[Feature] Add support for BBEH dataset (#1925)
* bbeh

* bbeh

* fix_smallbugs_bbeh

* removeprint

* results

---------

Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2025-03-12 10:53:31 +08:00
Kangreen
59e49aedf1
[Feature] Support SuperGPQA (#1924)
* support supergpqa

* remove unnecessary code

* remove unnecessary code

* Add Readme

* Add Readme

* fix lint

* fix lint

* update

* update

---------

Co-authored-by: mkj3085003 <mkj3085003@gmail.com>
Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-03-11 19:32:08 +08:00
Linchen Xiao
e403fd21be
[Fix] Fix math-verify evaluator (#1917)
* update

* update

* update
2025-03-11 17:35:04 +08:00
Linchen Xiao
cbf84fb33c
[Feature] Update LLM Evaluation for MMLU-Pro (#1923) 2025-03-07 21:01:20 +08:00
Myhs_phz
570c30cf1b
[Fix] Fix CLI option for results persistence (#1920)
* fix

* fix

* fix
2025-03-07 18:24:30 +08:00
Shudong Liu
277d7946f5
[Fix] Fix typo in deepseed_r1.md (#1916) 2025-03-05 19:37:22 +08:00
Myhs_phz
1585c0adbe
[Feature] Evaluation Results Persistence (#1894)
* feat results_station.py

* lint

* feat save_to_station

* feat result_station.py and lint

* feat

* fix

* fix and lint

* fix

* fix subjective processing

* fix

* fix

* style function name

* lint
2025-03-05 18:33:34 +08:00
Myhs_phz
54324657f0
[Docs] Results persistance (#1908)
* feat persistance.md

* doc

* doc

* lint

* doc

* fix

* doc
2025-03-05 18:23:54 +08:00
Dongsheng Zhu
fff2d51440
[Update] Code evaluation alignment (#1909)
* code alignment

* update oss md5

* bigcodebench update

* lint

* lint_

* lint yapf
2025-03-04 18:49:38 +08:00
Linchen Xiao
5547fd1592
[Bump] Bump version to 0.4.1 2025-03-04 18:26:14 +08:00
liushz
198c08632e
[Feature] Add HLE (Humanity's Last Exam) dataset (#1902)
* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* update dataset path

* Update olmpiadBench

* Update olmpiadBench

* Update olmpiadBench

* Add HLE dataset

* Add HLE dataset

* Add HLE dataset

---------

Co-authored-by: sudanl <sudanl@foxmail.com>
2025-03-04 16:42:37 +08:00
Songyang Zhang
c84bc18ac1
[Update] Support OlympiadBench-Math/OmniMath/LiveMathBench-Hard (#1899)
* [Update] Support OlympiadBench-Math/OmniMath/LiveMathBench-Hard with LLM Verify

* Update

* Update

* Update DeepSeek-R1 example

* Update DeepSeek-R1 example

* Update DeepSeek-R1 example
2025-03-03 18:56:11 +08:00
Junnan Liu
f0809fe6f6
[Update] Fix Hard Configs With General GPassK (#1906)
* support dataset repeat and g-pass compute for each evaluator

* fix pre-commit errors

* delete print

* delete gpassk_evaluator and fix potential errors

* change `repeat` to `n`

* fix `repeat` to `n` in openicl_eval

* update doc for multi-run and g-pass

* update latex equation in doc

* update eng doc for multi-run and g-pass

* update datasets.md

* update datasets.md

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation in zh_cn user_guides

* mmodify pre-commit-zh-cn

* recover pre-commit and edit math expr in doc

* del [TIP]

* del cite tag in doc

* del extract_model param in livemathbench config

* fix livemathbench hard configs
2025-03-03 18:17:15 +08:00
Linchen Xiao
6a573f671b
[Fix] Fix compatible issue 2025-03-03 15:35:57 +08:00
Junnan Liu
73c80953c6
[Feature] Support Dataset Repeat and G-Pass Compute for Each Evaluator (#1886)
* support dataset repeat and g-pass compute for each evaluator

* fix pre-commit errors

* delete print

* delete gpassk_evaluator and fix potential errors

* change `repeat` to `n`

* fix `repeat` to `n` in openicl_eval

* update doc for multi-run and g-pass

* update latex equation in doc

* update eng doc for multi-run and g-pass

* update datasets.md

* update datasets.md

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation in zh_cn user_guides

* mmodify pre-commit-zh-cn

* recover pre-commit and edit math expr in doc

* del [TIP]

* del cite tag in doc

* del extract_model param in livemathbench config
2025-02-26 19:43:12 +08:00
zhulinJulia24
6042b88e58
[CI] update dailytest sceduler and baseline's score(#1898) 2025-02-26 19:04:01 +08:00
Linchen Xiao
bdb2d46f59
[Feature] Add general math, llm judge evaluator (#1892)
* update_doc

* update llm_judge

* update README

* update md file name
2025-02-26 15:08:50 +08:00
Songyang Zhang
fd6fbf01a2
[Update] Support AIME-24 Evaluation for DeepSeek-R1 series (#1888)
* Update

* Update

* Update

* Update
2025-02-25 20:34:41 +08:00
Junnan Liu
22a33d8759
[Update] Update LiveMathBench Hard Configs (#1826)
* support G-Pass@k and livemathbench

* fix bugs

* fix comments of GPassKEvaluator

* update saved details of GPassKEvaluator

* update saved details of GPassKEvaluator

* fix eval api configs & update openai_api for ease of debugging

* update huggingface path

* fix method name of G-Pass@k

* fix default value of eval_model_name

* refactor G-Pass@k evaluator

* log generation params for each backend

* fix evaluation resume

* add notimplementerror

* update livemathbench-hard configs

* remove max_out_len from livemathbench_hard_greedy_gen_9befbf.py

* remove max_out_len from livemathbench_hard_gen_9befbf.py

* rename livemathbench_hard_gen_9befbf.py to livemathbench_hard_gen_353ae7.py

* rename livemathbench_hard_greedy_gen_9befbf.py to livemathbench_hard_greedy_gen_353ae7.py

* update livemathbench_gen_9befbf.py

* remove whitespace

* upload livemathbench hard configs
2025-02-25 17:24:36 +08:00
Dongsheng Zhu
465e93e10e
[Update] Academic bench llm judge update (#1876)
* BigCodeBench update

* update LCBench

* update LCBench 2

* update code

* academicBench update

* academic bench ifeval&math update

* generic_llmjudge_aime_academic_postprocess delete

* aime delete

* postprocessors update

* ifeval delete

* update work_dir

* linting

* linting double-quote-string-fixer

* r1-distill out_len update

* fix lint

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-02-24 15:45:24 +08:00
Junnan Liu
046b6f75c6
[Update] Update Greedy Config & README of LiveMathBench (#1862)
* support omni-math

* update config

* upload README

* Delete opencompass/configs/datasets/omni_math/__init__.py

* update greedy config & README of LiveMathBench

* update intro for  max_out_len

* rename livemathbench greedy confi

* delete greedy config

---------

Co-authored-by: liushz <qq1791167085@163.com>
2025-02-20 19:47:04 +08:00
Linchen Xiao
d7daee6e25
[Update] OpenAI model update, bigcodebench update (#1879)
* [Update] Openai model update, bigcodebench update

* update
2025-02-20 19:33:25 +08:00
Linchen Xiao
27c916661d
[Feature] Math Verify with model post_processor (#1881)
* update

* [Feature] Update model post_processor

* update

* update

* update
2025-02-20 19:32:12 +08:00
zhulinJulia24
bc22749fd8
[CI] update daily test scores (#1870)
* update

* Update daily-run-test.yml

* Update dlc.py
2025-02-20 14:08:18 +08:00
bittersweet1999
f407930475
[Feature] Support subjective evaluation for reasoning model (#1868)
* fix pip version

* fix pip version

* add subeval for reasoning model

* add subeval for reasoning model

* update configs

* update config

* update config

* update config

* update files
2025-02-20 12:19:46 +08:00
Myhs_phz
68a9838907
[Feature] Add list of supported datasets at html page (#1850)
* feat dataset-index.yml and stat.py

* fix

* fix

* fix

* feat url of paper and config file

* doc all supported dataset list

* docs zh and en

* docs README zh and en

* docs new_dataset

* docs new_dataset
2025-02-14 16:17:30 +08:00
Dongsheng Zhu
3fd8b4e0cd
[Update] Update BigCodeBench & LCBench load path (#1857)
* BigCodeBench update

* update LCBench

* update LCBench 2

* update code
2025-02-08 15:15:47 +08:00
Pablo Hinojosa
9c2e6a192c
[Fix] Update broken links in README.md (#1852) 2025-02-07 15:41:08 +08:00