Commit Graph

7 Commits

Author SHA1 Message Date
Songyang Zhang
f1e50d4bf0
[Update] Update LiveMathBench (#1809)
* Update LiveMathBench

* Update New O1 Evaluation

* Update O1 evaluation
2025-01-07 19:16:12 +08:00
Songyang Zhang
98435dd98e
[Feature] Update o1 evaluation with JudgeLLM (#1795)
* Update Generic LLM Evaluator

* Update o1 style evaluator
2024-12-30 17:31:00 +08:00
Junnan Liu
8e8d4f1c64
[Feature] Support G-Pass@k and LiveMathBench (#1772)
* support G-Pass@k and livemathbench

* fix bugs

* fix comments of GPassKEvaluator

* update saved details of GPassKEvaluator

* update saved details of GPassKEvaluator

* fix eval api configs & update openai_api for ease of debugging

* update huggingface path

* fix method name of G-Pass@k

* fix default value of eval_model_name

* refactor G-Pass@k evaluator

* log generation params for each backend

* fix evaluation resume

* add notimplementerror
2024-12-30 16:59:39 +08:00
Songyang Zhang
0d8df541bc
[Update] Update O1-style Benchmark and Prompts (#1742)
* Update JuderBench

* Support O1-style Prompts

* Update Code

* Update OpenAI

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update

* Update

* Update

* Update
2024-12-09 13:48:56 +08:00
Junnan Liu
f333be177c
[Update] Add MATH500 & AIME2024 to LiveMathBench (#1741)
* upload dataset definitions & configs

* add single dataset split specific metrics

* add k-pass@threshold & MATH500

* update std computation & k-pass computation

* add AIME224

* update README
2024-12-06 14:36:49 +08:00
Junnan Liu
6181ac1122
[Update] Update LiveMathBench Evaluation to Support Single Dataset Split Metric Computation (#1730)
* upload dataset definitions & configs

* add single dataset split specific metrics

* add k-pass@threshold & MATH500
2024-12-05 16:54:16 +08:00
Junnan Liu
fe6d76fb13
[Feature] Support LiveMathBench (#1727) 2024-11-30 00:07:19 +08:00