Hoter Young
c7e89aa3db
[Feature] Support answer extraction of QwQ when evaluating HuStandardFIB ( #36 )
2025-02-15 12:09:54 +08:00
Hoter Young
9676d99787
[Feature] Support answer extraction of QwQ when evaluating HuMatchingFIB ( #35 )
2025-02-14 22:36:11 +08:00
Hoter Young
362b281e55
[Feature] Support 3 models ( #34 )
...
opencompass/configs/models/deepseek/lmdeploy_deepseek_r1_distill_llama_70b_instruct.py
opencompass/configs/models/deepseek/lmdeploy_deepseek_r1_distill_qwen_14b_instruct.py
opencompass/configs/models/hf_llama/llama3_3_70b_api_siliconflow.py
2025-02-14 22:01:16 +08:00
Hoter Young
879b181c1b
add some features ( #32 )
...
* [Feature] Support answer extraction of QwQ when evaluating HuSimpleQA
* [Feature] Support mulit-language summarization in HuSimpleQASummarizer
* [Feature] Support DeepSeep-R1-Distill-Qwen_32B_turbomind
2025-02-14 20:44:53 +08:00
Hoter Young
b6c8165ca3
[Feature] Support answer extraction of QwQ when evaluating HuProverbRea_OE ( #31 )
2025-02-14 10:16:05 +08:00
Hoter Young
0971777348
[Feature] Support DeepSeep-R1-Distill-Qwen_32B ( #30 )
2025-02-13 21:42:16 +08:00
Hoter Young
f92a1e5050
[Feature] Support DeepSeep-R1 API from SenseTime ( #29 )
2025-02-13 20:50:57 +08:00
Hoter Young
4114079aed
[Fix] Fix HuSimpleQASummarizer bug ( #28 )
2025-02-13 11:28:49 +08:00
Hoter Young
6f5c16edc5
[Chores] do some minor changes to HuLifeQA ( #27 )
...
1. enlarge token size
2. add two r1 distill models
2025-02-12 21:43:11 +08:00
hoteryoung
23210e089a
[Refactor] Change HuSimpleQA to subjective evaluation
2025-02-12 20:25:03 +08:00
wujiang
60ab611ecd
set deepseek r1 batchsize = 1
2025-02-11 21:55:51 +08:00
wujiang
e261a76e07
set reasoning model max_out_len = 8192
2025-02-11 16:51:05 +08:00
weixingjian
cb664d0cea
add hu prompt for HuMatchingFIB task
2025-02-11 12:20:22 +08:00
wujiang
b4ecd718a0
update examples and configs
2025-02-10 23:08:43 +08:00
wujiang
f55810ae48
[Update] OpenHuEval examples
2025-02-10 23:08:43 +08:00
wujiang
1e1acf9236
add HuSimpleQA
2025-02-10 21:22:45 +08:00
wujiang
5741e38310
rename models
2025-02-10 17:24:24 +08:00
hoteryoung
c3b0803013
support deepseek-r1-distill-qwen-7b and -llama-8b
2025-02-10 17:24:24 +08:00
hoteryoung
f2c17190c9
enable tested reasoning model
2025-02-10 16:51:48 +08:00
wujiang
61ceb02c23
minor
2025-02-07 18:51:14 +08:00
weixingjian
9ae714a577
update hustandard and eval details using data version 250205
2025-02-07 18:51:14 +08:00
weixingjian
9395dc2b60
update humatching and eval details using data version 250205
2025-02-07 14:52:51 +08:00
wujiang
8ec47e2b93
add openai model
2025-02-07 14:43:53 +08:00
wujiang
08712f49f2
update HuProverb config and eval
2025-02-04 16:10:50 +08:00
wujiang
7586186897
add deepseek api models
2025-02-04 15:07:34 +08:00
wujiang
3c93a98e91
update HuLifeQA
2025-02-04 12:24:35 +08:00
gaojunyuan
f152ccf127
add HuProverbRea dataset (20250203)
2025-02-04 11:06:10 +08:00
wujiang
794ab7c372
add & update openai models
2025-02-02 15:53:55 +08:00
wujiang
2abf6ca795
update HuMatchingFIB
2025-02-02 14:48:58 +08:00
wujiang
273e609b53
update hu_matching_fib_250126
2025-02-02 13:48:40 +08:00
Hoter Young
3939915349
[Update] Update HuLifeQA primary tags ( #6 )
2025-02-01 14:18:05 +08:00
wujiang
d4df622e02
update HuMatchingFIB config and dataset
2025-01-26 13:48:35 +08:00
Hoter Young
116a24632c
[Feature] Add OpenHuEval-HuLifeQA ( #4 )
2025-01-24 10:32:17 +08:00
WayneWei
5f72e96d5b
add HuStandardFIB under new paradigm ( #3 )
...
Co-authored-by: weixingjian <weixingjian@pjlab.org.cn>
2025-01-22 19:32:44 +08:00
weixingjian
6527fdf70a
add HuMatchingFIB under new paradigm
2025-01-22 19:32:44 +08:00
Linchen Xiao
35ec307c6b
[Bump] Bump version to 0.4.0 ( #1838 )
2025-01-22 11:41:46 +08:00
Linchen Xiao
03415b2a66
[Fix] Update max_out_len logic for OpenAI model ( #1839 )
2025-01-21 15:46:14 +08:00
Linchen Xiao
a6193b4c02
[Refactor] Code refactoarization ( #1831 )
...
* Update
* fix lint
* update
* fix lint
2025-01-20 19:17:38 +08:00
Jishnu Nair
ffdc917523
[Doc] Installation.md update ( #1830 )
2025-01-17 11:08:09 +08:00
Myhs_phz
70da9b7776
[Update] Update method to add dataset in docs ( #1827 )
...
* create new branch
* docs new_dataset.md zh
* docs new_dataset.md zh and en
2025-01-17 11:07:19 +08:00
Linchen Xiao
531643e771
[Feature] Add support for InternLM3 ( #1829 )
...
* update
* update
* update
* update
2025-01-16 14:28:27 +08:00
Alexander Lam
7f2aeeff26
added predicted win rates reporting to bradley terry subj eval methods with an option to switch between win rates and elo ratings ( #1815 )
2025-01-10 18:20:25 +08:00
zhulinJulia24
121d482378
[CI] Fix path conflict ( #1814 )
...
* update
* Update pr-run-test.yml
* update
2025-01-09 20:16:08 +08:00
zhulinJulia24
abdcee68f6
[CI] Update daily test metrics threshold ( #1812 )
...
* Update daily-run-test.yml
* Update pr-run-test.yml
* update
* update
* update
* updaet
* update
* update
* update
* update
* update
* update
* update
---------
Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-01-09 18:16:24 +08:00
Zhao Qihao
e039f3efa0
[Feature] Support MMLU-CF Benchmark ( #1775 )
...
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* Update mmlu-cf
* Update mmlu-cf
* Update mmlu-cf
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* Remove outside configs
---------
Co-authored-by: liushz <qq1791167085@163.com>
2025-01-09 14:11:20 +08:00
Songyang Zhang
f1e50d4bf0
[Update] Update LiveMathBench ( #1809 )
...
* Update LiveMathBench
* Update New O1 Evaluation
* Update O1 evaluation
2025-01-07 19:16:12 +08:00
Songyang Zhang
8fdb72f567
[Update] Update o1 eval prompt ( #1806 )
...
* Update XML prediction post-process
* Update LiveMathBench
* Update LiveMathBench
* Update New O1 Evaluation
2025-01-07 00:14:32 +08:00
Alexander Lam
f871e80887
[Feature] Add Bradley-Terry Subjective Evaluation method to Arena Hard dataset ( #1802 )
...
* added base_models_abbrs to references (passed from LMEvaluator); added bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer;
* added bradleyterry subjective evaluation method to arena_hard dataset
2025-01-03 16:33:43 +08:00
Linchen Xiao
117dc500ad
[Feature] Add Longbenchv2 support ( #1801 )
...
* Create eval_longbenchv2.py
* Create longbenchv2_gen.py
* Update __init__.py
* Create longbenchv2.py
* Update datasets_info.py
* update
* update
* update
* update
* update
* update
---------
Co-authored-by: abrohamLee <146956824+abrohamLee@users.noreply.github.com>
2025-01-03 12:04:29 +08:00
Linchen Xiao
f3220438bc
[BUMP] Bump version to 0.3.9 ( #1790 )
2024-12-31 16:52:47 +08:00