Myhs_phz
75e7834b59
[Feature] Add Datasets: ClimateQA,Physics ( #2017 )
...
* feat ClimateQA
* feat PHYSICS
* fix
* fix
* fix
* fix
2025-04-14 20:18:47 +08:00
Linchen Xiao
6a6a1a5c0b
[Feature] LLM Judge sanity check ( #2012 )
...
* update
* update
2025-04-11 19:01:39 +08:00
bittersweet1999
3f50b1dc49
[Fix] fix order bug Update arena_hard.py ( #2015 )
2025-04-11 16:59:40 +08:00
Junnan Liu
20660ab507
[Fix] Fix compare error when k is list in base_evaluator ( #2010 )
...
* fix gpass compare error of list k
* fix compare error in 177
2025-04-10 19:47:21 +08:00
Linchen Xiao
12213207b6
[Refactor] Refactorize openicl eval task ( #1990 )
...
* [Refactor] Refactorize openicl eval task
* update
2025-04-09 15:52:23 +08:00
zhulinJulia24
6ac9b06bc2
[ci] update baseline for kernal change of vllm and lmdeploy ( #2011 )
...
* update
* update
* update
* update
* update
* update
* update
2025-04-09 14:09:35 +08:00
Linchen Xiao
a05f9da134
[Feature] Make dump-eval-details default behavior ( #1999 )
...
* Update
* update
* update
2025-04-08 14:42:26 +08:00
Myhs_phz
fd82bea747
[Fix] OpenICL Math Evaluator Config ( #2007 )
...
* fix
* fix recommended
* fix
* fix
* fix
* fix
2025-04-08 14:38:35 +08:00
Linchen Xiao
bb58cfc85d
[Feature] Add CascadeEvaluator ( #1992 )
...
* [Feature] Add CascadeEvaluator
* update
* updat
2025-04-08 11:58:14 +08:00
Jin Ye
b564e608b1
[Dataset] Add MedXpertQA ( #2002 )
...
* Add MedXpertQA
* Add MedXpertQA
* Add MedXpertQA
* Fix lint
---------
Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-04-08 10:44:48 +08:00
shijinpjlab
828fb745c9
[Dataset] Update dingo 1.5.0 ( #2008 )
...
Co-authored-by: shiin <shijin@pjlab.org.cn>
2025-04-07 17:21:15 +08:00
zhulinJulia24
f982d6278e
[CI] fix baseline score ( #2000 )
...
* update
* update
* update
* update
* update
* update
* update
* updaste
* update
* update
* updaste
* updaste
* update
* update
* update
* update
* update
* update
* update
* update
2025-04-03 19:32:36 +08:00
Myhs_phz
3a9a384173
[Doc] Fix links between zh & en ( #2001 )
...
* test
* test
* test
2025-04-03 17:37:53 +08:00
Myhs_phz
9b489e9ea0
[Update] Revert math500 dataset configs ( #1998 )
2025-04-03 15:11:02 +08:00
Linchen Xiao
dc8deb6af0
[BUMP] Bump version to 0.4.2 ( #1997 )
2025-04-02 17:47:15 +08:00
liushz
32d6859679
[Feature] Add olymmath dataset ( #1982 )
...
* Add olymmath dataset
* Add olymmath dataset
* Add olymmath dataset
* Update olymmath dataset
2025-04-02 17:34:07 +08:00
zhulinJulia24
97236c8e97
[CI] Fix baseline score ( #1996 )
...
* update
* update
* update
* update
2025-04-02 14:25:16 +08:00
Linchen Xiao
f66b0b347a
[Update] Requirements update ( #1993 )
2025-04-02 12:03:45 +08:00
Dongsheng Zhu
330a6e5ca7
[Update] Add Intervl-8b&38b model configs ( #1978 )
2025-04-01 11:51:37 +08:00
Myhs_phz
f71eb78c72
[Doc] Add TBD Token in Datasets Statistics ( #1986 )
...
* feat
* doc
* doc
* doc
* doc
2025-03-31 19:08:55 +08:00
Linchen Xiao
0f46c35211
[Bug] Aime2024 config fix ( #1974 )
...
lint / lint (push) Has been cancelled
* [Bug] Aime2024 config fix
* fix
2025-03-25 17:57:11 +08:00
Myhs_phz
6118596362
[Feature] Add recommendation configs for datasets ( #1937 )
...
* feat datasetrefine drop
* fix datasets in fullbench_int3
* fix
* fix
* back
* fix
* fix and doc
* feat
* fix hook
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* fix
* doc
* fix
* fix
* Update dataset-index.yml
2025-03-25 14:54:13 +08:00
Linchen Xiao
07930b854a
[Update] Add Korbench config with no max_out_len ( #1968 )
...
lint / lint (push) Waiting to run
* Add Korbench no max_out_len
* Add Korbench no max_out_len
2025-03-24 18:38:06 +08:00
Myhs_phz
37307fa996
[Update] Add QWQ32b model config ( #1959 )
...
lint / lint (push) Waiting to run
* feat qwq-32b
* fix
* feat phi_4
---------
Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2025-03-24 14:51:39 +08:00
Linchen Xiao
db96161a4e
[Update] Add SuperGPQA subset metrics ( #1966 )
2025-03-24 14:25:12 +08:00
Linchen Xiao
aa05993922
[Update] Add dataset configurations of no max_out_len ( #1967 )
...
* [Update] Add dataset configurations of no max_out_len
* update test torch version
* update test torch version
* update test torch version
* update test torch version
2025-03-24 14:24:12 +08:00
Linchen Xiao
64128916d0
[Update] Increase memory size for CPU job of VOLC Runner ( #1962 )
...
* [Update] Increase memory size for CPU job of VOLC Runner
* [Update] Increase memory size for CPU job of VOLC Runner
2025-03-24 11:21:14 +08:00
Dongsheng Zhu
8a5029b121
[Feature] Add MultiPL-E & Code Evaluator ( #1963 )
...
* multiple_code develop
* multiple_code update
* comments upadate
* index upadate
2025-03-21 20:09:25 +08:00
Linchen Xiao
b9de8b0e2b
[Update] Unset disallowed_special token for Openai model ( #1960 )
2025-03-18 20:24:07 +08:00
Songyang Zhang
c98599271b
[Update] Update OlympiadBench and Update LLM Judge ( #1954 )
2025-03-18 20:15:20 +08:00
Jason Cheung
5d2d253d83
[BUG] Fix model_kwargs pass logic for vllm ( #1958 )
2025-03-18 20:08:15 +08:00
Linchen Xiao
0b7f76e193
[Bug] Fix Summarizer logic ( #1953 )
2025-03-17 18:25:08 +08:00
Yufeng Zhao
15c825a51a
[Update] Bbeh harmony summarizer added ( #1951 )
...
* bbeh
* bbeh
* fix_smallbugs_bbeh
* removeprint
* harmonic
* update_summerizer
* harmonic-tested
* harmonic-tested
* clean
* clean
* cleaned_rebased
---------
Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2025-03-17 17:19:56 +08:00
Linchen Xiao
854c6bf025
[Update] Update requirement and base evaluator
2025-03-13 20:52:50 +08:00
Linchen Xiao
1c60e3a0f6
[Update] Add configurations for llmjudge dataset ( #1940 )
...
* Add configurations for llmjudge dataset
* update
2025-03-13 17:30:04 +08:00
liushz
709bc4af0e
[Update] Add AIME2025 oss info ( #1936 )
...
* Support OlympiadBench Benchmark
* Support OlympiadBench Benchmark
* Support OlympiadBench Benchmark
* update dataset path
* Update olmpiadBench
* Update olmpiadBench
* Update olmpiadBench
* Add HLE dataset
* Add HLE dataset
* Add HLE dataset
* Add AIME2025 oss info
---------
Co-authored-by: sudanl <sudanl@foxmail.com>
2025-03-12 18:41:16 +08:00
Yufeng Zhao
bc2969dba8
[Feature] Add support for BBEH dataset ( #1925 )
...
* bbeh
* bbeh
* fix_smallbugs_bbeh
* removeprint
* results
---------
Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2025-03-12 10:53:31 +08:00
Kangreen
59e49aedf1
[Feature] Support SuperGPQA ( #1924 )
...
* support supergpqa
* remove unnecessary code
* remove unnecessary code
* Add Readme
* Add Readme
* fix lint
* fix lint
* update
* update
---------
Co-authored-by: mkj3085003 <mkj3085003@gmail.com>
Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-03-11 19:32:08 +08:00
Linchen Xiao
e403fd21be
[Fix] Fix math-verify evaluator ( #1917 )
...
* update
* update
* update
2025-03-11 17:35:04 +08:00
Linchen Xiao
cbf84fb33c
[Feature] Update LLM Evaluation for MMLU-Pro ( #1923 )
2025-03-07 21:01:20 +08:00
Myhs_phz
570c30cf1b
[Fix] Fix CLI option for results persistence ( #1920 )
...
* fix
* fix
* fix
2025-03-07 18:24:30 +08:00
Shudong Liu
277d7946f5
[Fix] Fix typo in deepseed_r1.md ( #1916 )
2025-03-05 19:37:22 +08:00
Myhs_phz
1585c0adbe
[Feature] Evaluation Results Persistence ( #1894 )
...
* feat results_station.py
* lint
* feat save_to_station
* feat result_station.py and lint
* feat
* fix
* fix and lint
* fix
* fix subjective processing
* fix
* fix
* style function name
* lint
2025-03-05 18:33:34 +08:00
Myhs_phz
54324657f0
[Docs] Results persistance ( #1908 )
...
* feat persistance.md
* doc
* doc
* lint
* doc
* fix
* doc
2025-03-05 18:23:54 +08:00
Dongsheng Zhu
fff2d51440
[Update] Code evaluation alignment ( #1909 )
...
* code alignment
* update oss md5
* bigcodebench update
* lint
* lint_
* lint yapf
2025-03-04 18:49:38 +08:00
Linchen Xiao
5547fd1592
[Bump] Bump version to 0.4.1
2025-03-04 18:26:14 +08:00
liushz
198c08632e
[Feature] Add HLE (Humanity's Last Exam) dataset ( #1902 )
...
* Support OlympiadBench Benchmark
* Support OlympiadBench Benchmark
* Support OlympiadBench Benchmark
* update dataset path
* Update olmpiadBench
* Update olmpiadBench
* Update olmpiadBench
* Add HLE dataset
* Add HLE dataset
* Add HLE dataset
---------
Co-authored-by: sudanl <sudanl@foxmail.com>
2025-03-04 16:42:37 +08:00
Songyang Zhang
c84bc18ac1
[Update] Support OlympiadBench-Math/OmniMath/LiveMathBench-Hard ( #1899 )
...
* [Update] Support OlympiadBench-Math/OmniMath/LiveMathBench-Hard with LLM Verify
* Update
* Update
* Update DeepSeek-R1 example
* Update DeepSeek-R1 example
* Update DeepSeek-R1 example
2025-03-03 18:56:11 +08:00
Junnan Liu
f0809fe6f6
[Update] Fix Hard Configs With General GPassK ( #1906 )
...
* support dataset repeat and g-pass compute for each evaluator
* fix pre-commit errors
* delete print
* delete gpassk_evaluator and fix potential errors
* change `repeat` to `n`
* fix `repeat` to `n` in openicl_eval
* update doc for multi-run and g-pass
* update latex equation in doc
* update eng doc for multi-run and g-pass
* update datasets.md
* update datasets.md
* fix multi-line equation
* fix multi-line equation
* fix multi-line equation
* fix multi-line equation
* fix multi-line equation
* fix multi-line equation
* fix multi-line equation in zh_cn user_guides
* mmodify pre-commit-zh-cn
* recover pre-commit and edit math expr in doc
* del [TIP]
* del cite tag in doc
* del extract_model param in livemathbench config
* fix livemathbench hard configs
2025-03-03 18:17:15 +08:00
Linchen Xiao
6a573f671b
[Fix] Fix compatible issue
2025-03-03 15:35:57 +08:00