Commit Graph

132 Commits

Author SHA1 Message Date
Hoter Young
362b281e55
[Feature] Support 3 models (#34)
opencompass/configs/models/deepseek/lmdeploy_deepseek_r1_distill_llama_70b_instruct.py
opencompass/configs/models/deepseek/lmdeploy_deepseek_r1_distill_qwen_14b_instruct.py
opencompass/configs/models/hf_llama/llama3_3_70b_api_siliconflow.py
2025-02-14 22:01:16 +08:00
Hoter Young
879b181c1b
add some features (#32)
* [Feature] Support answer extraction of QwQ when evaluating HuSimpleQA

* [Feature] Support mulit-language summarization in HuSimpleQASummarizer

* [Feature] Support DeepSeep-R1-Distill-Qwen_32B_turbomind
2025-02-14 20:44:53 +08:00
Hoter Young
0971777348
[Feature] Support DeepSeep-R1-Distill-Qwen_32B (#30) 2025-02-13 21:42:16 +08:00
Hoter Young
f92a1e5050
[Feature] Support DeepSeep-R1 API from SenseTime (#29) 2025-02-13 20:50:57 +08:00
Hoter Young
6f5c16edc5
[Chores] do some minor changes to HuLifeQA (#27)
1. enlarge token size
2. add two r1 distill models
2025-02-12 21:43:11 +08:00
hoteryoung
23210e089a [Refactor] Change HuSimpleQA to subjective evaluation 2025-02-12 20:25:03 +08:00
wujiang
60ab611ecd set deepseek r1 batchsize = 1 2025-02-11 21:55:51 +08:00
wujiang
e261a76e07 set reasoning model max_out_len = 8192 2025-02-11 16:51:05 +08:00
weixingjian
cb664d0cea add hu prompt for HuMatchingFIB task 2025-02-11 12:20:22 +08:00
wujiang
b4ecd718a0 update examples and configs 2025-02-10 23:08:43 +08:00
wujiang
f55810ae48 [Update] OpenHuEval examples 2025-02-10 23:08:43 +08:00
wujiang
1e1acf9236 add HuSimpleQA 2025-02-10 21:22:45 +08:00
wujiang
5741e38310 rename models 2025-02-10 17:24:24 +08:00
hoteryoung
c3b0803013 support deepseek-r1-distill-qwen-7b and -llama-8b 2025-02-10 17:24:24 +08:00
hoteryoung
f2c17190c9 enable tested reasoning model 2025-02-10 16:51:48 +08:00
weixingjian
9ae714a577 update hustandard and eval details using data version 250205 2025-02-07 18:51:14 +08:00
weixingjian
9395dc2b60 update humatching and eval details using data version 250205 2025-02-07 14:52:51 +08:00
wujiang
8ec47e2b93 add openai model 2025-02-07 14:43:53 +08:00
wujiang
08712f49f2 update HuProverb config and eval 2025-02-04 16:10:50 +08:00
wujiang
7586186897 add deepseek api models 2025-02-04 15:07:34 +08:00
gaojunyuan
f152ccf127 add HuProverbRea dataset (20250203) 2025-02-04 11:06:10 +08:00
wujiang
794ab7c372 add & update openai models 2025-02-02 15:53:55 +08:00
wujiang
2abf6ca795 update HuMatchingFIB 2025-02-02 14:48:58 +08:00
wujiang
273e609b53 update hu_matching_fib_250126 2025-02-02 13:48:40 +08:00
Hoter Young
3939915349
[Update] Update HuLifeQA primary tags (#6) 2025-02-01 14:18:05 +08:00
wujiang
d4df622e02 update HuMatchingFIB config and dataset 2025-01-26 13:48:35 +08:00
Hoter Young
116a24632c
[Feature] Add OpenHuEval-HuLifeQA (#4) 2025-01-24 10:32:17 +08:00
WayneWei
5f72e96d5b add HuStandardFIB under new paradigm (#3)
Co-authored-by: weixingjian <weixingjian@pjlab.org.cn>
2025-01-22 19:32:44 +08:00
weixingjian
6527fdf70a add HuMatchingFIB under new paradigm 2025-01-22 19:32:44 +08:00
Linchen Xiao
a6193b4c02
[Refactor] Code refactoarization (#1831)
* Update

* fix lint

* update

* fix lint
2025-01-20 19:17:38 +08:00
Linchen Xiao
531643e771
[Feature] Add support for InternLM3 (#1829)
* update

* update

* update

* update
2025-01-16 14:28:27 +08:00
Zhao Qihao
e039f3efa0
[Feature] Support MMLU-CF Benchmark (#1775)
* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* Update mmlu-cf

* Update mmlu-cf

* Update mmlu-cf

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* Remove outside configs

---------

Co-authored-by: liushz <qq1791167085@163.com>
2025-01-09 14:11:20 +08:00
Songyang Zhang
f1e50d4bf0
[Update] Update LiveMathBench (#1809)
* Update LiveMathBench

* Update New O1 Evaluation

* Update O1 evaluation
2025-01-07 19:16:12 +08:00
Songyang Zhang
8fdb72f567
[Update] Update o1 eval prompt (#1806)
* Update XML prediction post-process

* Update LiveMathBench

* Update LiveMathBench

* Update New O1 Evaluation
2025-01-07 00:14:32 +08:00
Alexander Lam
f871e80887
[Feature] Add Bradley-Terry Subjective Evaluation method to Arena Hard dataset (#1802)
* added base_models_abbrs to references (passed from LMEvaluator); added bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer;

* added bradleyterry subjective evaluation method to arena_hard dataset
2025-01-03 16:33:43 +08:00
Linchen Xiao
117dc500ad
[Feature] Add Longbenchv2 support (#1801)
* Create eval_longbenchv2.py

* Create longbenchv2_gen.py

* Update __init__.py

* Create longbenchv2.py

* Update datasets_info.py

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: abrohamLee <146956824+abrohamLee@users.noreply.github.com>
2025-01-03 12:04:29 +08:00
liushz
9c980cbc62
[Feature] Add LiveStemBench Dataset (#1794)
* [Fix] Fix vllm max_seq_len parameter transfer

* [Fix] Fix vllm max_seq_len parameter transfer

* Add livestembench dataset

* Add livestembench dataset

* Add livestembench dataset

* Update livestembench_gen_3e3c50.py

* Update eval_livestembench.py

* Update eval_livestembench.py
2024-12-31 15:17:39 +08:00
Alexander Lam
dc6035cfcb
[Feature] Added Bradley-Terry subjective evaluation 2024-12-31 11:01:23 +08:00
Songyang Zhang
98435dd98e
[Feature] Update o1 evaluation with JudgeLLM (#1795)
* Update Generic LLM Evaluator

* Update o1 style evaluator
2024-12-30 17:31:00 +08:00
Junnan Liu
8e8d4f1c64
[Feature] Support G-Pass@k and LiveMathBench (#1772)
* support G-Pass@k and livemathbench

* fix bugs

* fix comments of GPassKEvaluator

* update saved details of GPassKEvaluator

* update saved details of GPassKEvaluator

* fix eval api configs & update openai_api for ease of debugging

* update huggingface path

* fix method name of G-Pass@k

* fix default value of eval_model_name

* refactor G-Pass@k evaluator

* log generation params for each backend

* fix evaluation resume

* add notimplementerror
2024-12-30 16:59:39 +08:00
Linchen Xiao
42b54d6bb8
[Update] Add 0shot CoT config for TheoremQA (#1783) 2024-12-27 16:17:27 +08:00
Linchen Xiao
ebefffed61
[Update] Update OC academic 202412 (#1771)
* [Update] Update academic settings

* Update

* update
2024-12-19 18:07:34 +08:00
Chang Lan
d70100cdf2
[Update] Customizable tokenizer for RULER (#1731)
* Customizable tokenizer for RULER

* Relax requirements
2024-12-19 18:02:11 +08:00
Linchen Xiao
eadbdcb4cb
[Update] Update requirement and deepseek configurations (#1764) 2024-12-17 10:16:47 +08:00
Alexander Lam
1bd594fc62
[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model (#1751)
* fix lint issues

* updated gitignore

* changed infer_order from random to double for the pairwise_judge.py (not changing for pairwise_bt_judge.py

* added return statement to CompassArenaBradleyTerrySummarizer to return overall score for each judger model
2024-12-16 13:41:28 +08:00
liushz
c4ce0174fe
[Fix] Fix ChineseSimpleQA max_out_len (#1757)
* add chinese simpleqa config

* add chinese simpleqa config

* add chinese simpleqa config

* add chinese simpleqa config

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* pdate Csimpleqa

* pdate Csimpleqa

* Update Csimpleqa

---------

Co-authored-by: 明念 <heyancheng.hyc@taobao.com>
2024-12-11 19:51:27 +08:00
Linchen Xiao
bd7b705be4
[Update] Update dataset configuration with no max_out_len (#1754) 2024-12-11 18:20:29 +08:00
OpenStellarTeam
1a5b3fc11e
Add Chinese SimpleQA config (#1697)
* add chinese simpleqa config

* add chinese simpleqa config

* add chinese simpleqa config

* add chinese simpleqa config

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* pdate Csimpleqa

---------

Co-authored-by: 明念 <heyancheng.hyc@taobao.com>
Co-authored-by: liushz <qq1791167085@163.com>
2024-12-11 18:03:39 +08:00
Linchen Xiao
0d26b348e4
[Feature] Add OC academic 2412 (#1750) 2024-12-10 21:53:06 +08:00
bittersweet1999
54c0fb7a93
[Change] Change Compassarena metric (#1749)
* fix pip version

* fix pip version

* fix summarizer bug

* fix compassarena

* fix compassarena

* fix compassarena
2024-12-10 14:45:32 +08:00