Hoter Young
|
6f5c16edc5
|
[Chores] do some minor changes to HuLifeQA (#27)
1. enlarge token size
2. add two r1 distill models
|
2025-02-12 21:43:11 +08:00 |
|
hoteryoung
|
23210e089a
|
[Refactor] Change HuSimpleQA to subjective evaluation
|
2025-02-12 20:25:03 +08:00 |
|
wujiang
|
60ab611ecd
|
set deepseek r1 batchsize = 1
|
2025-02-11 21:55:51 +08:00 |
|
wujiang
|
e261a76e07
|
set reasoning model max_out_len = 8192
|
2025-02-11 16:51:05 +08:00 |
|
weixingjian
|
cb664d0cea
|
add hu prompt for HuMatchingFIB task
|
2025-02-11 12:20:22 +08:00 |
|
wujiang
|
b4ecd718a0
|
update examples and configs
|
2025-02-10 23:08:43 +08:00 |
|
wujiang
|
f55810ae48
|
[Update] OpenHuEval examples
|
2025-02-10 23:08:43 +08:00 |
|
wujiang
|
1e1acf9236
|
add HuSimpleQA
|
2025-02-10 21:22:45 +08:00 |
|
wujiang
|
5741e38310
|
rename models
|
2025-02-10 17:24:24 +08:00 |
|
hoteryoung
|
c3b0803013
|
support deepseek-r1-distill-qwen-7b and -llama-8b
|
2025-02-10 17:24:24 +08:00 |
|
hoteryoung
|
f2c17190c9
|
enable tested reasoning model
|
2025-02-10 16:51:48 +08:00 |
|
wujiang
|
61ceb02c23
|
minor
|
2025-02-07 18:51:14 +08:00 |
|
weixingjian
|
9ae714a577
|
update hustandard and eval details using data version 250205
|
2025-02-07 18:51:14 +08:00 |
|
weixingjian
|
9395dc2b60
|
update humatching and eval details using data version 250205
|
2025-02-07 14:52:51 +08:00 |
|
wujiang
|
8ec47e2b93
|
add openai model
|
2025-02-07 14:43:53 +08:00 |
|
wujiang
|
08712f49f2
|
update HuProverb config and eval
|
2025-02-04 16:10:50 +08:00 |
|
wujiang
|
7586186897
|
add deepseek api models
|
2025-02-04 15:07:34 +08:00 |
|
wujiang
|
3c93a98e91
|
update HuLifeQA
|
2025-02-04 12:24:35 +08:00 |
|
gaojunyuan
|
f152ccf127
|
add HuProverbRea dataset (20250203)
|
2025-02-04 11:06:10 +08:00 |
|
wujiang
|
794ab7c372
|
add & update openai models
|
2025-02-02 15:53:55 +08:00 |
|
wujiang
|
2abf6ca795
|
update HuMatchingFIB
|
2025-02-02 14:48:58 +08:00 |
|
wujiang
|
273e609b53
|
update hu_matching_fib_250126
|
2025-02-02 13:48:40 +08:00 |
|
Hoter Young
|
3939915349
|
[Update] Update HuLifeQA primary tags (#6)
|
2025-02-01 14:18:05 +08:00 |
|
wujiang
|
d4df622e02
|
update HuMatchingFIB config and dataset
|
2025-01-26 13:48:35 +08:00 |
|
Hoter Young
|
116a24632c
|
[Feature] Add OpenHuEval-HuLifeQA (#4)
|
2025-01-24 10:32:17 +08:00 |
|
WayneWei
|
5f72e96d5b
|
add HuStandardFIB under new paradigm (#3)
Co-authored-by: weixingjian <weixingjian@pjlab.org.cn>
|
2025-01-22 19:32:44 +08:00 |
|
weixingjian
|
6527fdf70a
|
add HuMatchingFIB under new paradigm
|
2025-01-22 19:32:44 +08:00 |
|
Linchen Xiao
|
35ec307c6b
|
[Bump] Bump version to 0.4.0 (#1838)
|
2025-01-22 11:41:46 +08:00 |
|
Linchen Xiao
|
03415b2a66
|
[Fix] Update max_out_len logic for OpenAI model (#1839)
|
2025-01-21 15:46:14 +08:00 |
|
Linchen Xiao
|
a6193b4c02
|
[Refactor] Code refactoarization (#1831)
* Update
* fix lint
* update
* fix lint
|
2025-01-20 19:17:38 +08:00 |
|
Jishnu Nair
|
ffdc917523
|
[Doc] Installation.md update (#1830)
|
2025-01-17 11:08:09 +08:00 |
|
Myhs_phz
|
70da9b7776
|
[Update] Update method to add dataset in docs (#1827)
* create new branch
* docs new_dataset.md zh
* docs new_dataset.md zh and en
|
2025-01-17 11:07:19 +08:00 |
|
Linchen Xiao
|
531643e771
|
[Feature] Add support for InternLM3 (#1829)
* update
* update
* update
* update
|
2025-01-16 14:28:27 +08:00 |
|
Alexander Lam
|
7f2aeeff26
|
added predicted win rates reporting to bradley terry subj eval methods with an option to switch between win rates and elo ratings (#1815)
|
2025-01-10 18:20:25 +08:00 |
|
zhulinJulia24
|
121d482378
|
[CI] Fix path conflict (#1814)
* update
* Update pr-run-test.yml
* update
|
2025-01-09 20:16:08 +08:00 |
|
zhulinJulia24
|
abdcee68f6
|
[CI] Update daily test metrics threshold (#1812)
* Update daily-run-test.yml
* Update pr-run-test.yml
* update
* update
* update
* updaet
* update
* update
* update
* update
* update
* update
* update
---------
Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
|
2025-01-09 18:16:24 +08:00 |
|
Zhao Qihao
|
e039f3efa0
|
[Feature] Support MMLU-CF Benchmark (#1775)
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* Update mmlu-cf
* Update mmlu-cf
* Update mmlu-cf
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* Remove outside configs
---------
Co-authored-by: liushz <qq1791167085@163.com>
|
2025-01-09 14:11:20 +08:00 |
|
Songyang Zhang
|
f1e50d4bf0
|
[Update] Update LiveMathBench (#1809)
* Update LiveMathBench
* Update New O1 Evaluation
* Update O1 evaluation
|
2025-01-07 19:16:12 +08:00 |
|
Songyang Zhang
|
8fdb72f567
|
[Update] Update o1 eval prompt (#1806)
* Update XML prediction post-process
* Update LiveMathBench
* Update LiveMathBench
* Update New O1 Evaluation
|
2025-01-07 00:14:32 +08:00 |
|
Alexander Lam
|
f871e80887
|
[Feature] Add Bradley-Terry Subjective Evaluation method to Arena Hard dataset (#1802)
* added base_models_abbrs to references (passed from LMEvaluator); added bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer;
* added bradleyterry subjective evaluation method to arena_hard dataset
|
2025-01-03 16:33:43 +08:00 |
|
Linchen Xiao
|
117dc500ad
|
[Feature] Add Longbenchv2 support (#1801)
* Create eval_longbenchv2.py
* Create longbenchv2_gen.py
* Update __init__.py
* Create longbenchv2.py
* Update datasets_info.py
* update
* update
* update
* update
* update
* update
---------
Co-authored-by: abrohamLee <146956824+abrohamLee@users.noreply.github.com>
|
2025-01-03 12:04:29 +08:00 |
|
Linchen Xiao
|
f3220438bc
|
[BUMP] Bump version to 0.3.9 (#1790)
|
2024-12-31 16:52:47 +08:00 |
|
liushz
|
9c980cbc62
|
[Feature] Add LiveStemBench Dataset (#1794)
* [Fix] Fix vllm max_seq_len parameter transfer
* [Fix] Fix vllm max_seq_len parameter transfer
* Add livestembench dataset
* Add livestembench dataset
* Add livestembench dataset
* Update livestembench_gen_3e3c50.py
* Update eval_livestembench.py
* Update eval_livestembench.py
|
2024-12-31 15:17:39 +08:00 |
|
Songyang Zhang
|
fc0556ec8e
|
[Fix] Fix generic_llm_evaluator output_path (#1798)
* Fix output_path
* Add Logger
|
2024-12-31 13:05:05 +08:00 |
|
Alexander Lam
|
dc6035cfcb
|
[Feature] Added Bradley-Terry subjective evaluation
|
2024-12-31 11:01:23 +08:00 |
|
Songyang Zhang
|
98435dd98e
|
[Feature] Update o1 evaluation with JudgeLLM (#1795)
* Update Generic LLM Evaluator
* Update o1 style evaluator
|
2024-12-30 17:31:00 +08:00 |
|
Junnan Liu
|
8e8d4f1c64
|
[Feature] Support G-Pass@k and LiveMathBench (#1772)
* support G-Pass@k and livemathbench
* fix bugs
* fix comments of GPassKEvaluator
* update saved details of GPassKEvaluator
* update saved details of GPassKEvaluator
* fix eval api configs & update openai_api for ease of debugging
* update huggingface path
* fix method name of G-Pass@k
* fix default value of eval_model_name
* refactor G-Pass@k evaluator
* log generation params for each backend
* fix evaluation resume
* add notimplementerror
|
2024-12-30 16:59:39 +08:00 |
|
Linchen Xiao
|
42b54d6bb8
|
[Update] Add 0shot CoT config for TheoremQA (#1783)
|
2024-12-27 16:17:27 +08:00 |
|
bittersweet1999
|
357ce8c7a4
|
[Fix] Fix model summarizer abbr (#1789)
* fix pip version
* fix pip version
* fix model summarizer abbr
---------
Co-authored-by: root <bittersweet1999>
|
2024-12-27 14:45:08 +08:00 |
|
Linchen Xiao
|
ae9efb73ad
|
[CI] Pypi deploy workflow update (#1786)
|
2024-12-27 14:08:37 +08:00 |
|