Compare commits

...

642 Commits
0.1.9 ... main

Author SHA1 Message Date
Yu Sun
d572761cef
[Dataset] Add Smolinstruct configs (#2127)
Some checks failed
lint / lint (push) Has been cancelled
* 0-shot Smolinstruct

Add 0-shot evaluation and postprocess functions for Smolinstruct

* fix acc postprocessor

* update 0-shot acc postprocessor

* rename 0-shot
2025-05-29 14:09:08 +08:00
Linchen Xiao
408f5caff4
[Dataset] Add SuperGPQA subfield configs (#2124)
* update

* fix lint

* fix lint

* update precommit

* update precommit

* fix lint
2025-05-28 14:12:58 +08:00
Myhs_phz
6f3c670b99
add qwen3 lmdeply (#2126) 2025-05-27 19:41:13 +08:00
zhulinJulia24
c3779ebfc1
[ci] update dlc setting (#2112) 2025-05-22 16:47:57 +08:00
Songyang Zhang
aa2b89b6f8
[Update] Add CascadeEvaluator with Data Replica (#2022)
* Update CascadeEvaluator

* Update CascadeEvaluator

* Update CascadeEvaluator

* Update Config

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update
2025-05-20 16:46:55 +08:00
Dongsheng Zhu
7a7a4517ab
[Update] History code bench pass@k update (#2102)
* bigcodebench

* humaneval

* humanevalx

* humanevalx

* livecodebench

* mbpp

* humaneval_plus

* fix bug

* template

* max_out fix

* template update
2025-05-19 17:03:33 +08:00
kkscilife
8c0ccf9a6b
[CI] Fix Lint error (#2103) 2025-05-16 15:36:45 +08:00
kkscilife
6f3b6a5d12
[CI] Add gitleaks check (#2101) 2025-05-16 14:34:57 +08:00
tcheng
3d1760aba2
[Dataset] Add Scieval (#2089)
* style: pass all formatting hooks (yapf & quote fixer)

* revise name:Add Lifescience Sub-set Support for MMLU & SciEval (datasets + configs + loader)

* revise name:Add Lifescience SciEval (datasets + configs + loader+dataset-index.yml)

* Add Lifescience SciEval (datasets + configs + loader+dataset-index.yml)

* all categories of SciEval (datasets + configs + loader+dataset-index.yml)

* revise name:Add Lifescience SciEval (datasets + configs + loader+dataset-index.yml)

* revise :SciEval 5shot

---------

Co-authored-by: root <tangcheng231@mails.ucas.edu.cn>
2025-05-14 10:25:03 +08:00
Wei Li
b84518c656
[Dataset] Support MedMCQA and MedBullets benchmark (#2054)
* support medmcqa and medbullets benchmark

* Add Medbullets data folder for benchmark support

* revise gen name

* revise config file & remove csv file & add dataset info to dataset-index.yml

* remove csv file

* remove print in medbullets.py

* revise class name

* update_oss_info

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-05-13 17:10:50 +08:00
zhulinJulia24
d60f59dcab
[CI] update baseline and fix lmdeploy version (#2098)
* update

* update

* update

* update

* update

* update
2025-05-13 14:01:47 +08:00
bittersweet1999
9eaa1f6fec
Update icl_judge_evaluator.py (#2095) 2025-05-13 10:44:24 +08:00
Linchen Xiao
d590f557bb
[Update] OpenaiSDK handle empty content (#2096) 2025-05-12 19:38:30 +08:00
yuehua-s
c492e49e79
[Update] Add o4 in OpenaiSDK (#2083)
* feature:1.add o4-mini;2.o3 or o4-mini only support temperature==1

* feature:change 4o-mini to 4o

---------

Co-authored-by: yuehuazhang <yuehuazhang@tencent.com>
2025-05-12 18:39:44 +08:00
Dongsheng Zhu
2c79dc5227
[Dataset] Add human_eval/mbpp pro (#2092)
* add bench

* update

* bug fix

* time update

* add index

* fix repeat bug
2025-05-12 18:38:13 +08:00
huihui1999
345674f700
[Dataset] Add SciknowEval Dataset (#2070)
* first

* first

* first

* first

* SciKnowEval

* fix hash

* fix dataset-index & use official llm_judge_postprocess

* fix dataset-index.yml

* use official llmjudge_postprocess

* fix lint

* fix lint

* fix lint

* fix lint

* fix lint

* merge with main

---------

Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2025-05-12 17:23:44 +08:00
Kun Yuan
8aa18df368
[Dataset] HLE Biomedical version support (#2080)
* HLE Biomedical version support

* set up default category value for hle
2025-05-12 10:14:11 +08:00
huihui1999
44a7024ed5
[Dataset] MedCalc_Bench (#2072)
* MedCalc_Bench

* MedCal_Bench

* add hash

* fix hash

* fix comments &dataset-index yml

* fix lint

* fix lint

* fix lint

* fix lint

* fix lint

---------

Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2025-05-09 16:58:55 +08:00
Linchen Xiao
508e2b0cb2
[Update] Set load_from_cache_file to False (#2085) 2025-05-09 15:21:47 +08:00
Kun Yuan
7bdd3c1904
[Dataset] MMLU_Pro Biomedical Version Support (#2081) 2025-05-09 15:07:26 +08:00
Jin Ye
6097186a95
[Datasets] MedQA, ProteinLMBench; Add Models: huatuogpt, baichuanM1 (#2064)
* Add Datasets: MedQA, ProteinLMBench; Add Models: huatuogpt, baichuanM1

* Fix bugs for MedQA. Add info in dataset-index

* Add version code for MedQA and ProteinLMBench

* Add version code for MedQA and ProteinLMBench
2025-05-09 14:47:44 +08:00
Linchen Xiao
d72df59363
[Revert] Add Lifescience Sub-set Support for SciEval (#2059) (#2087)
This reverts commit c5048bfec7.
2025-05-09 14:46:27 +08:00
tcheng
c5048bfec7
[Dataset] Add Lifescience Sub-set Support for SciEval (#2059)
* style: pass all formatting hooks (yapf & quote fixer)

* revise name:Add Lifescience Sub-set Support for MMLU & SciEval (datasets + configs + loader)

* revise name:Add Lifescience SciEval (datasets + configs + loader+dataset-index.yml)

* Add Lifescience SciEval (datasets + configs + loader+dataset-index.yml)

---------

Co-authored-by: root <tangcheng231@mails.ucas.edu.cn>
2025-05-09 14:31:12 +08:00
huihui1999
a7f3ac20b2
[Dataset] Add CARDBiomedBench (#2071)
* CARDBiomedBench

* fix hash

* fix dataset-index

* use official llmjudge postprocess

* use official llmjudge_postprocess

* fix lint

* fix init
2025-05-08 19:44:46 +08:00
Mo Li
ff3275edf0
[Update] Add Long-Context configs for Gemma, OREAL, and Qwen2.5 models (#2048)
* [Update] Update Gemma, Oreal, Qwen Config

* fix lint
2025-05-08 19:06:56 +08:00
Wei Li
a685ed7daf
[Dataset] Add nejm ai benchmark (#2063)
* support nejm ai benchmark

* add dataset files

* revise gen name

* revise gen name

* revise class name & remove csv file & add dataset-index.yml info

* update

* update

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-05-08 16:44:05 +08:00
Jiahao Xu
9ec23c145b
[Datasets] Add ClinicBench, PubMedQA and ScienceQA (#2061)
* Add ClinicBench

* Add PubMedQA & ScienceQA & ClinicBench

* Add PubMedQA & ScienceQA & ClinicBench

* Update datasets_info & hf_path

* Update hf_path
2025-05-08 16:25:43 +08:00
Dongsheng Zhu
ba0e32292c
[Feature] Support InternSandbox (#2049)
* internsandbox init

* internsandbox

* dataset_index

* dataset_index_add
2025-05-07 16:42:09 +08:00
谢昕辰
43b2c4ed76
[Fix] Update lawbench data path (#2037) 2025-05-07 16:18:43 +08:00
Dongsheng Zhu
d62b69aaef
[Fix] Fix InternVL model config (#2068)
* intervl-8b&38b

* intervl adjustment

* internvl fix
2025-05-07 15:51:18 +08:00
Linchen Xiao
af8432e1d6
[Update] OpenAI SDK model reasoning content (#2078)
* update

* update

* update
2025-05-07 14:06:40 +08:00
bittersweet1999
ddc9cc0afb
[Add] add a config to Judge dataset all (#2077)
* fix pip version

* fix pip version

* add judgedatasetall

* add judgedatasetall

* add judgedatasetall
2025-05-07 10:57:23 +08:00
bittersweet1999
37cbaf8d92
[Add] Add Judgerbenchv2 (#2067)
* fix pip version

* fix pip version

* add judgerbenchv2

* Update __init__.py
2025-04-30 17:12:34 +08:00
Taolin Zhang
b6148aa198
add Judgebench (#2066)
* add rewardbench

* add rewardbench

* add rmb datasets

* add rmb datasets

* add judgebench

* add judgebench
2025-04-30 15:01:10 +08:00
bittersweet1999
527a80947b
[Add] Add writingbench (#2028)
* fix pip version

* fix pip version

* add writingbench

* add writingbench

* add writingbench

* add writingbench
2025-04-29 16:29:32 +08:00
Taolin Zhang
8c74e6a39e
add RMB Bench (#2056)
* add rewardbench

* add rewardbench

* add rmb datasets

* add rmb datasets
2025-04-27 16:26:01 +08:00
Linchen Xiao
e8bc8c1e8c
[Bug] Concat OpenaiSDK reasoning content (#2041)
* [Bug] Concat OpenaiSDK reasoning content

* [Bug] Concat OpenaiSDK reasoning content

* update

* update
2025-04-25 14:10:33 +08:00
Junnan Liu
97010dc4ce
[Update] Update dataset repeat concatenation (#2039) 2025-04-23 16:16:28 +08:00
Linchen Xiao
dcbf899369
[Bug] Fix SmolInsturct logger import (#2036) 2025-04-23 11:10:30 +08:00
Linchen Xiao
bf74f26603
[Update] Safe SmolInstruct meteor calculation (#2033) 2025-04-22 18:27:48 +08:00
Linchen Xiao
455bb05d1b
[Update] Update dataset configs (#2030)
* [Update] Update dataset configs

* Fix lint
2025-04-21 18:55:06 +08:00
Taolin Zhang
c69110361b
[Add] add rewardbench (#2029)
* add rewardbench

* add rewardbench
2025-04-21 17:18:51 +08:00
JuchengHu
a2093a81ef
[Dataset] Matbench (#2021)
* add support for matbench

* fix dataset path

* fix data load

* fix

* fix lint

---------

Co-authored-by: Jucheng Hu <jucheng.hu.20@ucl.ac.uk>
Co-authored-by: Myhs-phz <demarcia2014@126.com>
2025-04-21 15:50:47 +08:00
Linchen Xiao
b2da1c08a8
[Dataset] Add SmolInstruct, Update Chembench (#2025)
* [Dataset] Add SmolInstruct, Update Chembench

* Add dataset metadata

* update

* update

* update
2025-04-18 17:21:29 +08:00
Linchen Xiao
65ff602cf5
[Update] Fix LLM Judge metrics cacluation & Add reasoning content concat to OpenAI SDK 2025-04-15 11:33:16 +08:00
Myhs_phz
75e7834b59
[Feature] Add Datasets: ClimateQA,Physics (#2017)
* feat ClimateQA

* feat PHYSICS

* fix

* fix

* fix

* fix
2025-04-14 20:18:47 +08:00
Linchen Xiao
6a6a1a5c0b
[Feature] LLM Judge sanity check (#2012)
* update

* update
2025-04-11 19:01:39 +08:00
bittersweet1999
3f50b1dc49
[Fix] fix order bug Update arena_hard.py (#2015) 2025-04-11 16:59:40 +08:00
Junnan Liu
20660ab507
[Fix] Fix compare error when k is list in base_evaluator (#2010)
* fix gpass compare error of list k

* fix compare error in 177
2025-04-10 19:47:21 +08:00
Linchen Xiao
12213207b6
[Refactor] Refactorize openicl eval task (#1990)
* [Refactor] Refactorize openicl eval task

* update
2025-04-09 15:52:23 +08:00
zhulinJulia24
6ac9b06bc2
[ci] update baseline for kernal change of vllm and lmdeploy (#2011)
* update

* update

* update

* update

* update

* update

* update
2025-04-09 14:09:35 +08:00
Linchen Xiao
a05f9da134
[Feature] Make dump-eval-details default behavior (#1999)
* Update

* update

* update
2025-04-08 14:42:26 +08:00
Myhs_phz
fd82bea747
[Fix] OpenICL Math Evaluator Config (#2007)
* fix

* fix recommended

* fix

* fix

* fix

* fix
2025-04-08 14:38:35 +08:00
Linchen Xiao
bb58cfc85d
[Feature] Add CascadeEvaluator (#1992)
* [Feature] Add CascadeEvaluator

* update

* updat
2025-04-08 11:58:14 +08:00
Jin Ye
b564e608b1
[Dataset] Add MedXpertQA (#2002)
* Add MedXpertQA

* Add MedXpertQA

* Add MedXpertQA

* Fix lint

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-04-08 10:44:48 +08:00
shijinpjlab
828fb745c9
[Dataset] Update dingo 1.5.0 (#2008)
Co-authored-by: shiin <shijin@pjlab.org.cn>
2025-04-07 17:21:15 +08:00
zhulinJulia24
f982d6278e
[CI] fix baseline score (#2000)
* update

* update

* update

* update

* update

* update

* update

* updaste

* update

* update

* updaste

* updaste

* update

* update

* update

* update

* update

* update

* update

* update
2025-04-03 19:32:36 +08:00
Myhs_phz
3a9a384173
[Doc] Fix links between zh & en (#2001)
* test

* test

* test
2025-04-03 17:37:53 +08:00
Myhs_phz
9b489e9ea0
[Update] Revert math500 dataset configs (#1998) 2025-04-03 15:11:02 +08:00
Linchen Xiao
dc8deb6af0
[BUMP] Bump version to 0.4.2 (#1997) 2025-04-02 17:47:15 +08:00
liushz
32d6859679
[Feature] Add olymmath dataset (#1982)
* Add olymmath dataset

* Add olymmath dataset

* Add olymmath dataset

* Update olymmath dataset
2025-04-02 17:34:07 +08:00
zhulinJulia24
97236c8e97
[CI] Fix baseline score (#1996)
* update

* update

* update

* update
2025-04-02 14:25:16 +08:00
Linchen Xiao
f66b0b347a
[Update] Requirements update (#1993) 2025-04-02 12:03:45 +08:00
Dongsheng Zhu
330a6e5ca7
[Update] Add Intervl-8b&38b model configs (#1978) 2025-04-01 11:51:37 +08:00
Myhs_phz
f71eb78c72
[Doc] Add TBD Token in Datasets Statistics (#1986)
* feat

* doc

* doc

* doc

* doc
2025-03-31 19:08:55 +08:00
Linchen Xiao
0f46c35211
[Bug] Aime2024 config fix (#1974)
Some checks failed
lint / lint (push) Has been cancelled
* [Bug] Aime2024 config fix

* fix
2025-03-25 17:57:11 +08:00
Myhs_phz
6118596362
[Feature] Add recommendation configs for datasets (#1937)
* feat datasetrefine drop

* fix datasets in fullbench_int3

* fix

* fix

* back

* fix

* fix and doc

* feat

* fix hook

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* doc

* fix

* fix

* Update dataset-index.yml
2025-03-25 14:54:13 +08:00
Linchen Xiao
07930b854a
[Update] Add Korbench config with no max_out_len (#1968)
Some checks are pending
lint / lint (push) Waiting to run
* Add Korbench no max_out_len

* Add Korbench no max_out_len
2025-03-24 18:38:06 +08:00
Myhs_phz
37307fa996
[Update] Add QWQ32b model config (#1959)
Some checks are pending
lint / lint (push) Waiting to run
* feat qwq-32b

* fix

* feat phi_4

---------

Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2025-03-24 14:51:39 +08:00
Linchen Xiao
db96161a4e
[Update] Add SuperGPQA subset metrics (#1966) 2025-03-24 14:25:12 +08:00
Linchen Xiao
aa05993922
[Update] Add dataset configurations of no max_out_len (#1967)
* [Update] Add dataset configurations of no max_out_len

* update test torch version

* update test torch version

* update test torch version

* update test torch version
2025-03-24 14:24:12 +08:00
Linchen Xiao
64128916d0
[Update] Increase memory size for CPU job of VOLC Runner (#1962)
* [Update] Increase memory size for CPU job of VOLC Runner

* [Update] Increase memory size for CPU job of VOLC Runner
2025-03-24 11:21:14 +08:00
Dongsheng Zhu
8a5029b121
[Feature] Add MultiPL-E & Code Evaluator (#1963)
* multiple_code develop

* multiple_code update

* comments upadate

* index upadate
2025-03-21 20:09:25 +08:00
Linchen Xiao
b9de8b0e2b
[Update] Unset disallowed_special token for Openai model (#1960) 2025-03-18 20:24:07 +08:00
Songyang Zhang
c98599271b
[Update] Update OlympiadBench and Update LLM Judge (#1954) 2025-03-18 20:15:20 +08:00
Jason Cheung
5d2d253d83
[BUG] Fix model_kwargs pass logic for vllm (#1958) 2025-03-18 20:08:15 +08:00
Linchen Xiao
0b7f76e193
[Bug] Fix Summarizer logic (#1953) 2025-03-17 18:25:08 +08:00
Yufeng Zhao
15c825a51a
[Update] Bbeh harmony summarizer added (#1951)
* bbeh

* bbeh

* fix_smallbugs_bbeh

* removeprint

* harmonic

* update_summerizer

* harmonic-tested

* harmonic-tested

* clean

* clean

* cleaned_rebased

---------

Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2025-03-17 17:19:56 +08:00
Linchen Xiao
854c6bf025
[Update] Update requirement and base evaluator 2025-03-13 20:52:50 +08:00
Linchen Xiao
1c60e3a0f6
[Update] Add configurations for llmjudge dataset (#1940)
* Add configurations for llmjudge dataset

* update
2025-03-13 17:30:04 +08:00
liushz
709bc4af0e
[Update] Add AIME2025 oss info (#1936)
* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* update dataset path

* Update olmpiadBench

* Update olmpiadBench

* Update olmpiadBench

* Add HLE dataset

* Add HLE dataset

* Add HLE dataset

* Add AIME2025 oss info

---------

Co-authored-by: sudanl <sudanl@foxmail.com>
2025-03-12 18:41:16 +08:00
Yufeng Zhao
bc2969dba8
[Feature] Add support for BBEH dataset (#1925)
* bbeh

* bbeh

* fix_smallbugs_bbeh

* removeprint

* results

---------

Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2025-03-12 10:53:31 +08:00
Kangreen
59e49aedf1
[Feature] Support SuperGPQA (#1924)
* support supergpqa

* remove unnecessary code

* remove unnecessary code

* Add Readme

* Add Readme

* fix lint

* fix lint

* update

* update

---------

Co-authored-by: mkj3085003 <mkj3085003@gmail.com>
Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-03-11 19:32:08 +08:00
Linchen Xiao
e403fd21be
[Fix] Fix math-verify evaluator (#1917)
* update

* update

* update
2025-03-11 17:35:04 +08:00
Linchen Xiao
cbf84fb33c
[Feature] Update LLM Evaluation for MMLU-Pro (#1923) 2025-03-07 21:01:20 +08:00
Myhs_phz
570c30cf1b
[Fix] Fix CLI option for results persistence (#1920)
* fix

* fix

* fix
2025-03-07 18:24:30 +08:00
Shudong Liu
277d7946f5
[Fix] Fix typo in deepseed_r1.md (#1916) 2025-03-05 19:37:22 +08:00
Myhs_phz
1585c0adbe
[Feature] Evaluation Results Persistence (#1894)
* feat results_station.py

* lint

* feat save_to_station

* feat result_station.py and lint

* feat

* fix

* fix and lint

* fix

* fix subjective processing

* fix

* fix

* style function name

* lint
2025-03-05 18:33:34 +08:00
Myhs_phz
54324657f0
[Docs] Results persistance (#1908)
* feat persistance.md

* doc

* doc

* lint

* doc

* fix

* doc
2025-03-05 18:23:54 +08:00
Dongsheng Zhu
fff2d51440
[Update] Code evaluation alignment (#1909)
* code alignment

* update oss md5

* bigcodebench update

* lint

* lint_

* lint yapf
2025-03-04 18:49:38 +08:00
Linchen Xiao
5547fd1592
[Bump] Bump version to 0.4.1 2025-03-04 18:26:14 +08:00
liushz
198c08632e
[Feature] Add HLE (Humanity's Last Exam) dataset (#1902)
* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* update dataset path

* Update olmpiadBench

* Update olmpiadBench

* Update olmpiadBench

* Add HLE dataset

* Add HLE dataset

* Add HLE dataset

---------

Co-authored-by: sudanl <sudanl@foxmail.com>
2025-03-04 16:42:37 +08:00
Songyang Zhang
c84bc18ac1
[Update] Support OlympiadBench-Math/OmniMath/LiveMathBench-Hard (#1899)
* [Update] Support OlympiadBench-Math/OmniMath/LiveMathBench-Hard with LLM Verify

* Update

* Update

* Update DeepSeek-R1 example

* Update DeepSeek-R1 example

* Update DeepSeek-R1 example
2025-03-03 18:56:11 +08:00
Junnan Liu
f0809fe6f6
[Update] Fix Hard Configs With General GPassK (#1906)
* support dataset repeat and g-pass compute for each evaluator

* fix pre-commit errors

* delete print

* delete gpassk_evaluator and fix potential errors

* change `repeat` to `n`

* fix `repeat` to `n` in openicl_eval

* update doc for multi-run and g-pass

* update latex equation in doc

* update eng doc for multi-run and g-pass

* update datasets.md

* update datasets.md

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation in zh_cn user_guides

* mmodify pre-commit-zh-cn

* recover pre-commit and edit math expr in doc

* del [TIP]

* del cite tag in doc

* del extract_model param in livemathbench config

* fix livemathbench hard configs
2025-03-03 18:17:15 +08:00
Linchen Xiao
6a573f671b
[Fix] Fix compatible issue 2025-03-03 15:35:57 +08:00
Junnan Liu
73c80953c6
[Feature] Support Dataset Repeat and G-Pass Compute for Each Evaluator (#1886)
* support dataset repeat and g-pass compute for each evaluator

* fix pre-commit errors

* delete print

* delete gpassk_evaluator and fix potential errors

* change `repeat` to `n`

* fix `repeat` to `n` in openicl_eval

* update doc for multi-run and g-pass

* update latex equation in doc

* update eng doc for multi-run and g-pass

* update datasets.md

* update datasets.md

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation in zh_cn user_guides

* mmodify pre-commit-zh-cn

* recover pre-commit and edit math expr in doc

* del [TIP]

* del cite tag in doc

* del extract_model param in livemathbench config
2025-02-26 19:43:12 +08:00
zhulinJulia24
6042b88e58
[CI] update dailytest sceduler and baseline's score(#1898) 2025-02-26 19:04:01 +08:00
Linchen Xiao
bdb2d46f59
[Feature] Add general math, llm judge evaluator (#1892)
* update_doc

* update llm_judge

* update README

* update md file name
2025-02-26 15:08:50 +08:00
Songyang Zhang
fd6fbf01a2
[Update] Support AIME-24 Evaluation for DeepSeek-R1 series (#1888)
* Update

* Update

* Update

* Update
2025-02-25 20:34:41 +08:00
Junnan Liu
22a33d8759
[Update] Update LiveMathBench Hard Configs (#1826)
* support G-Pass@k and livemathbench

* fix bugs

* fix comments of GPassKEvaluator

* update saved details of GPassKEvaluator

* update saved details of GPassKEvaluator

* fix eval api configs & update openai_api for ease of debugging

* update huggingface path

* fix method name of G-Pass@k

* fix default value of eval_model_name

* refactor G-Pass@k evaluator

* log generation params for each backend

* fix evaluation resume

* add notimplementerror

* update livemathbench-hard configs

* remove max_out_len from livemathbench_hard_greedy_gen_9befbf.py

* remove max_out_len from livemathbench_hard_gen_9befbf.py

* rename livemathbench_hard_gen_9befbf.py to livemathbench_hard_gen_353ae7.py

* rename livemathbench_hard_greedy_gen_9befbf.py to livemathbench_hard_greedy_gen_353ae7.py

* update livemathbench_gen_9befbf.py

* remove whitespace

* upload livemathbench hard configs
2025-02-25 17:24:36 +08:00
Dongsheng Zhu
465e93e10e
[Update] Academic bench llm judge update (#1876)
* BigCodeBench update

* update LCBench

* update LCBench 2

* update code

* academicBench update

* academic bench ifeval&math update

* generic_llmjudge_aime_academic_postprocess delete

* aime delete

* postprocessors update

* ifeval delete

* update work_dir

* linting

* linting double-quote-string-fixer

* r1-distill out_len update

* fix lint

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-02-24 15:45:24 +08:00
Junnan Liu
046b6f75c6
[Update] Update Greedy Config & README of LiveMathBench (#1862)
* support omni-math

* update config

* upload README

* Delete opencompass/configs/datasets/omni_math/__init__.py

* update greedy config & README of LiveMathBench

* update intro for  max_out_len

* rename livemathbench greedy confi

* delete greedy config

---------

Co-authored-by: liushz <qq1791167085@163.com>
2025-02-20 19:47:04 +08:00
Linchen Xiao
d7daee6e25
[Update] OpenAI model update, bigcodebench update (#1879)
* [Update] Openai model update, bigcodebench update

* update
2025-02-20 19:33:25 +08:00
Linchen Xiao
27c916661d
[Feature] Math Verify with model post_processor (#1881)
* update

* [Feature] Update model post_processor

* update

* update

* update
2025-02-20 19:32:12 +08:00
zhulinJulia24
bc22749fd8
[CI] update daily test scores (#1870)
* update

* Update daily-run-test.yml

* Update dlc.py
2025-02-20 14:08:18 +08:00
bittersweet1999
f407930475
[Feature] Support subjective evaluation for reasoning model (#1868)
* fix pip version

* fix pip version

* add subeval for reasoning model

* add subeval for reasoning model

* update configs

* update config

* update config

* update config

* update files
2025-02-20 12:19:46 +08:00
Myhs_phz
68a9838907
[Feature] Add list of supported datasets at html page (#1850)
* feat dataset-index.yml and stat.py

* fix

* fix

* fix

* feat url of paper and config file

* doc all supported dataset list

* docs zh and en

* docs README zh and en

* docs new_dataset

* docs new_dataset
2025-02-14 16:17:30 +08:00
Dongsheng Zhu
3fd8b4e0cd
[Update] Update BigCodeBench & LCBench load path (#1857)
* BigCodeBench update

* update LCBench

* update LCBench 2

* update code
2025-02-08 15:15:47 +08:00
Pablo Hinojosa
9c2e6a192c
[Fix] Update broken links in README.md (#1852) 2025-02-07 15:41:08 +08:00
zhulinJulia24
ffc04cf650
[CI] Update daily-run-test.yml (#1854) 2025-02-07 14:40:16 +08:00
Linchen Xiao
862bf78464
[Demo] Internlm3 math500 thinking demo (#1846)
* [Demo] Add demo for Internlm3 math500 thinking

* [Demo] Add demo for Internlm3 math500 thinking

* update max_out_len

* update start instruction
2025-01-24 14:56:41 +08:00
Shudong Liu
412199f802
[Feature] Support OlympiadBench Benchmark (#1841)
* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* update dataset path

* Update olmpiadBench

* Update olmpiadBench

* Update olmpiadBench

---------

Co-authored-by: liushz <qq1791167085@163.com>
2025-01-24 10:00:01 +08:00
Junnan Liu
70f2c963d3
[Feature] Support Omni-Math (#1837)
* support omni-math

* update config

* upload README

* Delete opencompass/configs/datasets/omni_math/__init__.py

---------

Co-authored-by: liushz <qq1791167085@163.com>
2025-01-23 18:36:54 +08:00
Linchen Xiao
35ec307c6b
[Bump] Bump version to 0.4.0 (#1838) 2025-01-22 11:41:46 +08:00
Linchen Xiao
03415b2a66
[Fix] Update max_out_len logic for OpenAI model (#1839) 2025-01-21 15:46:14 +08:00
Linchen Xiao
a6193b4c02
[Refactor] Code refactoarization (#1831)
* Update

* fix lint

* update

* fix lint
2025-01-20 19:17:38 +08:00
Jishnu Nair
ffdc917523
[Doc] Installation.md update (#1830) 2025-01-17 11:08:09 +08:00
Myhs_phz
70da9b7776
[Update] Update method to add dataset in docs (#1827)
* create new branch

* docs new_dataset.md zh

* docs new_dataset.md zh and en
2025-01-17 11:07:19 +08:00
Linchen Xiao
531643e771
[Feature] Add support for InternLM3 (#1829)
* update

* update

* update

* update
2025-01-16 14:28:27 +08:00
Alexander Lam
7f2aeeff26
added predicted win rates reporting to bradley terry subj eval methods with an option to switch between win rates and elo ratings (#1815) 2025-01-10 18:20:25 +08:00
zhulinJulia24
121d482378
[CI] Fix path conflict (#1814)
* update

* Update pr-run-test.yml

* update
2025-01-09 20:16:08 +08:00
zhulinJulia24
abdcee68f6
[CI] Update daily test metrics threshold (#1812)
* Update daily-run-test.yml

* Update pr-run-test.yml

* update

* update

* update

* updaet

* update

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-01-09 18:16:24 +08:00
Zhao Qihao
e039f3efa0
[Feature] Support MMLU-CF Benchmark (#1775)
* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* Update mmlu-cf

* Update mmlu-cf

* Update mmlu-cf

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* Remove outside configs

---------

Co-authored-by: liushz <qq1791167085@163.com>
2025-01-09 14:11:20 +08:00
Songyang Zhang
f1e50d4bf0
[Update] Update LiveMathBench (#1809)
* Update LiveMathBench

* Update New O1 Evaluation

* Update O1 evaluation
2025-01-07 19:16:12 +08:00
Songyang Zhang
8fdb72f567
[Update] Update o1 eval prompt (#1806)
* Update XML prediction post-process

* Update LiveMathBench

* Update LiveMathBench

* Update New O1 Evaluation
2025-01-07 00:14:32 +08:00
Alexander Lam
f871e80887
[Feature] Add Bradley-Terry Subjective Evaluation method to Arena Hard dataset (#1802)
* added base_models_abbrs to references (passed from LMEvaluator); added bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer;

* added bradleyterry subjective evaluation method to arena_hard dataset
2025-01-03 16:33:43 +08:00
Linchen Xiao
117dc500ad
[Feature] Add Longbenchv2 support (#1801)
* Create eval_longbenchv2.py

* Create longbenchv2_gen.py

* Update __init__.py

* Create longbenchv2.py

* Update datasets_info.py

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: abrohamLee <146956824+abrohamLee@users.noreply.github.com>
2025-01-03 12:04:29 +08:00
Linchen Xiao
f3220438bc
[BUMP] Bump version to 0.3.9 (#1790) 2024-12-31 16:52:47 +08:00
liushz
9c980cbc62
[Feature] Add LiveStemBench Dataset (#1794)
* [Fix] Fix vllm max_seq_len parameter transfer

* [Fix] Fix vllm max_seq_len parameter transfer

* Add livestembench dataset

* Add livestembench dataset

* Add livestembench dataset

* Update livestembench_gen_3e3c50.py

* Update eval_livestembench.py

* Update eval_livestembench.py
2024-12-31 15:17:39 +08:00
Songyang Zhang
fc0556ec8e
[Fix] Fix generic_llm_evaluator output_path (#1798)
* Fix output_path

* Add Logger
2024-12-31 13:05:05 +08:00
Alexander Lam
dc6035cfcb
[Feature] Added Bradley-Terry subjective evaluation 2024-12-31 11:01:23 +08:00
Songyang Zhang
98435dd98e
[Feature] Update o1 evaluation with JudgeLLM (#1795)
* Update Generic LLM Evaluator

* Update o1 style evaluator
2024-12-30 17:31:00 +08:00
Junnan Liu
8e8d4f1c64
[Feature] Support G-Pass@k and LiveMathBench (#1772)
* support G-Pass@k and livemathbench

* fix bugs

* fix comments of GPassKEvaluator

* update saved details of GPassKEvaluator

* update saved details of GPassKEvaluator

* fix eval api configs & update openai_api for ease of debugging

* update huggingface path

* fix method name of G-Pass@k

* fix default value of eval_model_name

* refactor G-Pass@k evaluator

* log generation params for each backend

* fix evaluation resume

* add notimplementerror
2024-12-30 16:59:39 +08:00
Linchen Xiao
42b54d6bb8
[Update] Add 0shot CoT config for TheoremQA (#1783) 2024-12-27 16:17:27 +08:00
bittersweet1999
357ce8c7a4
[Fix] Fix model summarizer abbr (#1789)
* fix pip version

* fix pip version

* fix model summarizer abbr

---------

Co-authored-by: root <bittersweet1999>
2024-12-27 14:45:08 +08:00
Linchen Xiao
ae9efb73ad
[CI] Pypi deploy workflow update (#1786) 2024-12-27 14:08:37 +08:00
Linchen Xiao
f103e90764
[CI] Update deploy python version (#1784) 2024-12-27 13:35:36 +08:00
zhulinJulia24
ebeb578fbf
[ci] remove daily step retry and update pr score (#1782)
[ci] remove daily step retry
2024-12-26 16:51:26 +08:00
Linchen Xiao
56eaac6d8f
[Update] Volc status exception handle (#1780)
* update

* update
2024-12-26 15:43:24 +08:00
zhulinJulia24
c48bbde26f
[ci] remove testcase into volc engine (#1777)
* update

* update

* update

* update

* update

* update

* updaste

* update

* update

* update

* update

* update

* update

* update

* updaste

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Update daily-run-test.yml

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update
2024-12-25 17:26:50 +08:00
Linchen Xiao
ebefffed61
[Update] Update OC academic 202412 (#1771)
* [Update] Update academic settings

* Update

* update
2024-12-19 18:07:34 +08:00
Chang Lan
d70100cdf2
[Update] Customizable tokenizer for RULER (#1731)
* Customizable tokenizer for RULER

* Relax requirements
2024-12-19 18:02:11 +08:00
Junnan Liu
499302857f
[Fix] Fix Local Runner Params Save Path (#1768)
* update local runner params save dir

* fix remove

* fix directory remove

* Fix *_params.py by uuid4
2024-12-19 16:07:34 +08:00
Mashiro
9a5adbde6a
[Fix] Fix lark reporter issue (#1769) 2024-12-18 19:33:06 +08:00
zhulinJulia24
111f817e04
[ci] add fullbench testcase (#1766)
add volc testcase
2024-12-18 13:24:28 +08:00
bittersweet1999
38dba9919b
[Fix] Fix Subjective summarizer order error (#1767)
* fix pip version

* fix pip version

* fix order error
2024-12-18 13:21:31 +08:00
Linchen Xiao
d593bfeac8
[Bump] Bump version to 0.3.8 (#1765)
* [Bump] Bump version to 0.3.8

* Update README.md
2024-12-17 19:17:18 +08:00
Linchen Xiao
eadbdcb4cb
[Update] Update requirement and deepseek configurations (#1764) 2024-12-17 10:16:47 +08:00
liushz
5c8e91f329
[Fix] Fix vllm max_seq_len parameter transfer (#1745)
* [Fix] Fix vllm max_seq_len parameter transfer

* [Fix] Fix vllm max_seq_len parameter transfer

* Update pr-run-test.yml

* Update pr-run-test.yml

---------

Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>
2024-12-16 21:44:36 +08:00
Alexander Lam
1bd594fc62
[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model (#1751)
* fix lint issues

* updated gitignore

* changed infer_order from random to double for the pairwise_judge.py (not changing for pairwise_bt_judge.py

* added return statement to CompassArenaBradleyTerrySummarizer to return overall score for each judger model
2024-12-16 13:41:28 +08:00
zhulinJulia24
aeded4c4db
add new dataset summerizer (#1758)
add new dataset summerizer
2024-12-13 09:50:43 +08:00
zhulinJulia24
a1c00cc8b7
[ci] add common_summarizer return (#1724)
* Update common_summarizer.py

* Update common_summarizer.py
2024-12-11 20:38:32 +08:00
liushz
c4ce0174fe
[Fix] Fix ChineseSimpleQA max_out_len (#1757)
* add chinese simpleqa config

* add chinese simpleqa config

* add chinese simpleqa config

* add chinese simpleqa config

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* pdate Csimpleqa

* pdate Csimpleqa

* Update Csimpleqa

---------

Co-authored-by: 明念 <heyancheng.hyc@taobao.com>
2024-12-11 19:51:27 +08:00
Linchen Xiao
bd7b705be4
[Update] Update dataset configuration with no max_out_len (#1754) 2024-12-11 18:20:29 +08:00
OpenStellarTeam
1a5b3fc11e
Add Chinese SimpleQA config (#1697)
* add chinese simpleqa config

* add chinese simpleqa config

* add chinese simpleqa config

* add chinese simpleqa config

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* pdate Csimpleqa

---------

Co-authored-by: 明念 <heyancheng.hyc@taobao.com>
Co-authored-by: liushz <qq1791167085@163.com>
2024-12-11 18:03:39 +08:00
Linchen Xiao
0d26b348e4
[Feature] Add OC academic 2412 (#1750) 2024-12-10 21:53:06 +08:00
bittersweet1999
54c0fb7a93
[Change] Change Compassarena metric (#1749)
* fix pip version

* fix pip version

* fix summarizer bug

* fix compassarena

* fix compassarena

* fix compassarena
2024-12-10 14:45:32 +08:00
Songyang Zhang
0d8df541bc
[Update] Update O1-style Benchmark and Prompts (#1742)
* Update JuderBench

* Support O1-style Prompts

* Update Code

* Update OpenAI

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update

* Update

* Update

* Update
2024-12-09 13:48:56 +08:00
Junnan Liu
f333be177c
[Update] Add MATH500 & AIME2024 to LiveMathBench (#1741)
* upload dataset definitions & configs

* add single dataset split specific metrics

* add k-pass@threshold & MATH500

* update std computation & k-pass computation

* add AIME224

* update README
2024-12-06 14:36:49 +08:00
bittersweet1999
08d63b5bf3
[Fix] Fix error in subjective default summarizer (#1740)
* fix pip version

* fix pip version

* fix summarizer bug
2024-12-06 11:03:53 +08:00
Songyang Zhang
fb43dd1906
[Update] Update Skywork/Qwen-QwQ (#1728)
* Update JuderBench

* Support O1-style Prompts

* Update Code

* Update OpenAI

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update
2024-12-05 19:30:43 +08:00
Junnan Liu
6181ac1122
[Update] Update LiveMathBench Evaluation to Support Single Dataset Split Metric Computation (#1730)
* upload dataset definitions & configs

* add single dataset split specific metrics

* add k-pass@threshold & MATH500
2024-12-05 16:54:16 +08:00
Linchen Xiao
4f317d1bd5
[Update] Update Manifest (#1738) 2024-12-05 13:59:56 +08:00
Linchen Xiao
ac23f0ce1f
[Update] Update init file for Korbench (#1737) 2024-12-05 11:26:00 +08:00
Yufeng Zhao
4d773904d4
[Update] Korbench readme supplementation (#1734)
* renewed

* readme

---------

Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2024-12-05 11:24:35 +08:00
Linchen Xiao
a011be6798
[Feature] DLC runner Lark report (#1735)
* [Bump] Bump version to 0.3.7

* DLC lark report update
2024-12-04 18:03:12 +08:00
Linchen Xiao
e2a290fd46
[Bump] Bump version to 0.3.7 (#1733) 2024-12-03 19:34:57 +08:00
Yufeng Zhao
98c4666d65
[Update] Update Korbench dataset abbr (#1729)
Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2024-12-02 16:20:58 +08:00
Linchen Xiao
9de27b4d85
[Update] Update max_out_len for datasets (#1726)
* [Update] Update max_out_len for datasets

* Update eval_regression_chat_objective_fullbench.py

* Update eval_regression_chat.py

* Update eval_regression_chat.py

* Update oc_score_baseline_fullbench.yaml

---------

Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>
2024-12-02 11:42:07 +08:00
Junnan Liu
fe6d76fb13
[Feature] Support LiveMathBench (#1727) 2024-11-30 00:07:19 +08:00
liushz
b063779034
[Fix] Update P-MMEVAL OSS data (#1722)
* Update with PMMEval

* Update

* Update __init__.py

* Fix Bugs

* Delete .pre-commit-config.yaml

* Pull merge

* Fix pmmeval_gen config

* Update P-MMEVAL data

---------

Co-authored-by: wanyu <wanyu2018umac@gmail.com>
Co-authored-by: wanyu2018umac <42405907+wanyu2018umac@users.noreply.github.com>
2024-11-28 20:55:46 +08:00
liushz
c437135fad
[Feature] Add Openai Simpleqa dataset (#1720)
* Add Openai SimpleQA dataset

* Add Openai SimpleQA dataset

* Add Openai SimpleQA dataset

* Update eval_simpleqa.py

---------

Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2024-11-28 19:16:07 +08:00
liushz
06ab27861e
[Fix] Fix pmmeval_gen config (#1719)
* Update with PMMEval

* Update

* Update __init__.py

* Fix Bugs

* Delete .pre-commit-config.yaml

* Pull merge

* Fix pmmeval_gen config

---------

Co-authored-by: wanyu <wanyu2018umac@gmail.com>
Co-authored-by: wanyu2018umac <42405907+wanyu2018umac@users.noreply.github.com>
2024-11-28 11:53:36 +08:00
wanyu2018umac
90efcf2216
[Feature] Add P-MMEval (#1714)
* Update with PMMEval

* Update

* Update __init__.py

* Fix Bugs

* Delete .pre-commit-config.yaml

* Pull merge

---------

Co-authored-by: liushz <qq1791167085@163.com>
2024-11-27 21:26:18 +08:00
Junnan Liu
f7dbe6bb7d
[Feature] Add Arc Prize Public Evaluation (#1690)
* support arc prize

* update arc-prize dataset info & update arc-prize evaluation performance
2024-11-27 15:44:41 +08:00
Yi Ding
bcb707dbfc
[Fix] Fix BailingAPI model (#1707)
* [fix] sequence under the multiple samples

* resolve the lint problems

* change the parameter name

* add another error code for retry

* output the log for invalid response

* format correction

* update

* update

* update

* update

* add two model python files

* update the default parameter

* use random for delay

* update the api example of bailing

* remove the unnecessary parameter
2024-11-26 19:24:47 +08:00
Linchen Xiao
ef695e28e5
[Bug] Fix Korbench dataset module (#1717) 2024-11-26 17:13:28 +08:00
Songyang Zhang
f97c4eae42
[Update] Update Fullbench (#1712)
* Update JuderBench

* Support O1-style Prompts

* Update Code
2024-11-26 14:26:55 +08:00
Yufeng Zhao
300adc31e8
[Feature] Add Korbench dataset (#1713)
* first version for korbench

* first stage for korbench

* korbench_1

* korbench_1

* korbench_1

* korbench_1

* korbench_1_revised

* korbench_combined_1

* korbench_combined_1

* kor_combined

* kor_combined

* update

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2024-11-25 20:11:27 +08:00
Chang Lan
5c1916ea4c
[Update] Add RULER 64k config (#1709) 2024-11-25 19:35:27 +08:00
liushz
e49fcfd3a3
[Update] Update MATH dataset with model judge (#1711)
* Update math with llm judge

* Update math with llm judge

* Update math with llm judge

* Update math with llm judge

* Update math with llm judge
2024-11-25 15:14:55 +08:00
Linchen Xiao
80e3b9ef37
[Update] Add math prm 800k (#1708) 2024-11-21 21:29:43 +08:00
Linchen Xiao
500fb1032a
[Update] Update configurations (#1704) 2024-11-21 16:51:18 +08:00
zhulinJulia24
ed81f9df30
[CI] update torch version and add more datasets into daily testcase (#1701)
* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-11-21 10:37:33 +08:00
Yi Ding
05044dfaf2
[Update] Support new error code for Bailing model (#1702)
* support new error code

* fix the lint problems
2024-11-20 16:40:22 +08:00
Linchen Xiao
ff831b153e
[BUMP] Bump version to 0.3.6 (#1694) 2024-11-18 20:24:50 +08:00
Linchen Xiao
ab8fdbbaab
[Update] Update Math auto-download data (#1700) 2024-11-18 20:24:35 +08:00
Linchen Xiao
98242ff1d1
[Update] first_option_postprocess (#1699)
* update first_option_postprocess

* update
2024-11-18 20:14:29 +08:00
Linchen Xiao
4653f6976e
[Update] update volc CPU flavor (#1698) 2024-11-18 12:33:51 +08:00
zhulinJulia24
4a20e1176d
[CI] Update baselines (#1693)
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-11-15 14:46:29 +08:00
Linchen Xiao
40a9f0be0d
[Update] MUSR dataset config prefix update (#1692) 2024-11-15 11:06:30 +08:00
abrohamLee
e9e4b69ddb
[Feature] MuSR Datset Evaluation (#1689)
* MuSR Datset Evaluation

* MuSR Datset Evaluation

Add an assertion and a Readme.md
2024-11-14 20:42:12 +08:00
Linchen Xiao
d415439f9b
[Fix] Fix bug for first_option_postprocess (#1688) 2024-11-14 16:45:59 +08:00
Linchen Xiao
e92a5d4230
[Feature] BABILong Dataset added (#1684)
* update

* update

* update

* update
2024-11-14 15:32:43 +08:00
Linchen Xiao
2fee63f537
[Update] Auto-download for followbench (#1685) 2024-11-13 15:47:29 +08:00
zhulinJulia24
f8a1c1f487
[CI] update (#1682)
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-11-13 10:48:05 +08:00
bittersweet1999
aca8ec3c6a
[Hotfix] Hotfix (#1683)
* fix pip version

* fix pip version

* fix lint

* hotfix
2024-11-13 10:14:27 +08:00
zhulinJulia24
a9d6b6461f
[ci] react daily test (#1668)
* updaste

* update

* update

* update

* update

* update

* update

* update

* update

* update

* updaste

* update

* update

* refactor summarize

* update

* update

* update

* update

* update

* updaste

* update

* update

* update

* update

* updaste

* update

* update

* update

* update

* update

* updaste

* updaste

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Update daily-run-test.yml

* Update daily-run-test.yml

* update

* update

* update

* update

* update

* Update daily-run-test.yml

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Update daily-run-test.yml

* Update daily-run-test.yml

* update

* update

* Update daily-run-test.yml

* update

* update

* update

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-11-12 18:40:27 +08:00
sobeit
3ec178f4a9
add single lora adapter support for vLLM inference. (#1679) 2024-11-12 17:31:36 +08:00
bittersweet1999
17b5e52f6c
[Hotfix] lmdeploy temp (#1674)
* fix pip version

* fix pip version

* hotfix
2024-11-12 16:10:16 +08:00
Linchen Xiao
a0ef2fd3b4
[Update] Dingo Dataset update (#1670)
* [Update] Dingo Dataset update

* update
2024-11-08 14:38:43 +08:00
Linchen Xiao
835bf75a36
[Feature] Add long context evaluation for base models (#1666)
* [Update] Add base long context evaluation

* update
2024-11-08 10:53:29 +08:00
Chang Cheng
fd7aa83c01
[Update] Update DLC Runner(#1662)
* push interntrain hard code

* push interntrain hard code

* remove redundant post process

---------

Co-authored-by: changcheng <changcheng@pjlab.org.cb>
Co-authored-by: changcheng <changcheng@pjlab.org.cn>
2024-11-07 15:45:35 +08:00
Linchen Xiao
db258eb7d5
[Bump] Bump version to v0.3.5 (#1657) 2024-11-03 21:23:35 +08:00
Lyu Han
888f1f3bef
[Fix] Update loglikehood compatibility (#1659) 2024-11-02 17:19:11 +08:00
liushz
f7d899823c
[Update] Update mmmlu_lite dataload (#1658)
* update mmmlu_lite dataload from oss

* update mmmlu_lite dataload from oss
2024-11-01 17:32:29 +08:00
Songyang Zhang
c789ce5698
[Fix] the automatically download for several datasets (#1652)
* [Fix] the automatically download for several datasets

* Update

* Update

* Update CI
2024-11-01 15:57:18 +08:00
Linchen Xiao
695738a89b
[Update] Add lmdeploy DeepSeek configs (#1656)
* [Update] Add lmdeploy DeepSeek configs

* update max out length
2024-11-01 15:34:23 +08:00
bittersweet1999
a0853c939d
[Add] Add CompassArenaSubjectiveBench (#1645)
* fix pip version

* fix pip version

* add compassarenasubjectivebench

* add compassarenasubjectivebench

* add compassarenabench
2024-11-01 13:52:22 +08:00
Songyang Zhang
d611907d14
[Doc] Update Doc (#1655) 2024-10-31 18:08:09 +08:00
Linchen Xiao
5212ffe8e2
[Update] Add new model configs (#1653) 2024-10-30 17:24:53 +08:00
Linchen Xiao
df57c08ccf
[Feature] Update Models, Summarizers (#1600) 2024-10-29 18:37:15 +08:00
Linchen Xiao
d91d66792a
[Update] Update Needlebench OSS path (#1651) 2024-10-29 18:05:44 +08:00
Chang Lan
46affab882
[Fix] Fix ruler_16k_gen (#1643) 2024-10-29 17:58:43 +08:00
Linchen Xiao
8172af49bb
[Update] Update wildbench max_seq_len (#1648)
* [Update] Wildbench max_seq_len update

* [Update] Wildbench max_seq_len update
2024-10-29 13:21:31 +08:00
Junnan Liu
645c5f3b2c
[Datasets] Add datasets CMO&AIME (#1610)
* add datasets cmo&aime

* delete unused modules

* modify prompt

* update __init__

* update data load and add README

* update data load

* update performance

* update md5

* remove indents

* add indent

* fix log for debug mode
2024-10-28 18:08:02 +08:00
Linchen Xiao
9c39cb68d4
[Bump] Bump version to 0.3.4 (#1639) 2024-10-25 20:10:16 +08:00
Linchen Xiao
a61e8a0803
[Update] Internal humaneval add (#1641)
* [Update] internal_humaneval_add

* update
2024-10-25 19:08:42 +08:00
Songyang Zhang
84be90669b
[Update] Fix issue of *_param.py, avoid name conflict;add keep_tmp_file flag to support keep the temp config file. (#1640) 2024-10-25 16:39:25 +08:00
BigDong
2542bc6907
[Feature] Support results saving as md format table (#1638) 2024-10-25 15:50:33 +08:00
Linchen Xiao
22fdea4bf2
[Update] Update DLC runner (#1637) 2024-10-24 21:36:16 +08:00
Lyu Han
fb12c3f98a
[Update] strip stop_words (#1635) 2024-10-24 20:39:20 +08:00
Linchen Xiao
662dddf41a
[Update] Add internal humaneval postprocess (#1636) 2024-10-24 17:45:21 +08:00
Linchen Xiao
be3c06a158
[Fix] Update common summarizer regex extraction (#1631) 2024-10-22 14:35:45 +08:00
Chang Lan
a927bba1cf
[Fix] Fix RULER datasets (#1628)
We need to ensure that we don't import anything that ends with "_datasets",
or they will be picked up by the runner, leading to duplicate / unwanted datasets
being evaluated.
2024-10-22 11:59:02 +08:00
Songyang Zhang
a4d5a6c81b
[Feature] Support LiveCodeBench (#1617)
* Update

* Update LCB

* Update

* Update

* Update

* Update

* Update
2024-10-21 20:50:39 +08:00
Chenguang Li
5868d5afa4
[Bug] Fix-NPU-Support (#1618)
* bugfix NPU support

* formatting

---------

Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-10-21 17:42:53 +08:00
liushz
500b44ba2d
[Fix] gpqa_few_shot_ppl prompt bug (#1627) 2024-10-21 16:59:06 +08:00
Linchen Xiao
096c347e7d
[Fix] Qwen 2.5 model config (#1626)
* [Fix] Fix Qwen 2.5 model config

* [Fix] Fix Qwen 2.5 model config

* [Fix] Fix Qwen 2.5 model config
2024-10-21 16:58:18 +08:00
bittersweet1999
1188e1ecf0
[Update] eval_judgerbench.py (#1625) 2024-10-21 15:30:29 +08:00
zhulinJulia24
825d3388d5
[CI] Test PR staging fixed (#1624)
* Update oc_score_baseline.yaml

* Update runtime.txt
2024-10-21 11:02:37 +08:00
bittersweet1999
a11e2b2fd4
[Fix] Compatible with old versions (#1616)
* fix pip version

* fix pip version

* Compatible with old versions

* compati old version

* compati old version

* compati old version

* update configs
2024-10-21 10:16:29 +08:00
Lyu Han
6e8adf5221
[Bug] Remove prefix bos_token from messages when using lmdeploy as the accelerator (#1623)
* remove prefix bos_token from messages when using lmdeploy as the accelerator

* update
2024-10-19 20:03:47 +08:00
zhulinJulia24
b89c7b2fc3
[CI] Update daily-run-test.yml (#1620) 2024-10-18 18:30:35 +08:00
Bob Tsang
dd0b655bd0
[Feature] Support MMMLU & MMMLU-lite Benchmark (#1565)
* rm folder

* modify format according to reviewer

* modify format according to reviewer

* modify format according to reviewer

* add some files requirement

* fix some bug

* fix bug

* change load type

* Update MMMLU Dataset

* Update MMMLU Dataset

* Add MMMLU-Lite Dataset

* update MMMMLU datast

* update MMMMLU datast

* update MMMMLU datast

---------

Co-authored-by: BobTsang <BobTsang1995@gmail.com>
Co-authored-by: liushz <qq1791167085@163.com>
2024-10-17 19:09:34 +08:00
bittersweet1999
f0d436496e
[Update] update docs and add compassarena (#1614)
* fix pip version

* fix pip version

* update docs and add compassarena

* update docs
2024-10-17 14:39:06 +08:00
Haoran Que
4fe251729b
Upload HelloBench (#1607)
* upload hellobench

* update hellobench

* update readme.md

* update eval_hellobench.py

* update lastest

---------

Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>
2024-10-15 17:11:37 +08:00
bittersweet1999
fa54aa62f6
[Feature] Add Judgerbench and reorg subeval (#1593)
* fix pip version

* fix pip version

* update (#1522)

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>

* [Feature] Update Models (#1518)

* Update Models

* Update

* Update humanevalx

* Update

* Update

* [Feature] Dataset prompts update for ARC, BoolQ, Race (#1527)

add judgerbench and reorg sub

add judgerbench and reorg subeval

add judgerbench and reorg subeval

* add judgerbench and reorg subeval

* add judgerbench and reorg subeval

* add judgerbench and reorg subeval

* add judgerbench and reorg subeval

---------

Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2024-10-15 16:36:05 +08:00
x54-729
2b1afa7d1e
[Fix] fix interntrain's tokenizer truncate (#1605)
Co-authored-by: x54-729 <xingshuhao.dispatch@pjlab.org.cn>
2024-10-15 16:03:57 +08:00
zhulinJulia24
8aba547e06
[ci] fix stable issue of daily test (#1602)
* update

* update

* update

* Update daily-run-test.yml

* update

* Update daily-run-test.yml

* update

* update

* update

* Update pr-run-test.yml

* Update pr-run-test.yml

* update

* update

* Update daily-run-test.yml

* update

* update

* update

* update

* Update daily-run-test.yml

* Update daily-run-test.yml

* updaste

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-10-15 10:14:49 +08:00
Linchen Xiao
f390697a5e
[Fix] Update dlc runner python env (#1604) 2024-10-14 15:50:21 +08:00
Lyu Han
4fde41036f
[Feature] Update TurboMindModel by integrating lmdeploy pipeline API (#1556)
* integrate lmdeploy's pipeline api

* fix linting

* update user guide

* rename

* update

* update

* update

* rollback class name

* update

* remove unused code

* update

* update

* use pipeline

* fix ci check

* compatibility

* compatibility

* remove concurrency

* update

* fix table content

* update
2024-10-14 15:33:40 +08:00
liushz
5faee929db
[Feature] Add GaoKaoMath Dataset for Evaluation & MATH Model Eval Config (#1589)
* Add GaoKaoMath Dataset

* Add MATH LLM Eval

* Update GAOKAO Math Eval Dataset

* Update GAOKAO Math Eval Dataset
2024-10-12 19:13:06 +08:00
Linchen Xiao
69997f11f8
[Feature] Update requirements.txt (#1601)
* update crb

* update crbbench

* update crbbench

* update crbbench

* minor update wildbench

* [Fix] Update doc of wildbench, and merge wildbench into subjective

* [Fix] Update doc of wildbench, and merge wildbench into subjective, fix crbbench

* Update crb.md

* Update crb_pair_judge.py

* Update crb_single_judge.py

* Update subjective_evaluation.md

* Update openai_api.py

* [Update] update wildbench readme

* [Update] update wildbench readme

* [Update] update wildbench readme, remove crb

* Delete configs/eval_subjective_wildbench_pair.py

* Delete configs/eval_subjective_wildbench_single.py

* Update __init__.py

* [Fix] fix version mismatch for CIBench

* [Fix] fix version mismatch for CIBench, local runer

* [Fix] fix version mismatch for CIBench, local runer, remove oracle mode

* BUG: Update cibench.py

* BUG: Update cibench.py

* [Bug] Update agent.txt

* update agent

* Update agent.txt

* update readme

* update

---------

Co-authored-by: kleinzcy <zhangchy2@shanghaitech.edu.cn>
Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>
2024-10-12 18:26:57 +08:00
bittersweet1999
3f7a3730d7
[Fix] fix Flames (#1599)
* fix pip version

* fix pip version

* fix flames

* fix flames
2024-10-12 14:34:59 +08:00
Lyu Han
b52ba65c26
[Feature] Integrate lmdeploy pipeline api (#1198)
* integrate lmdeploy's pipeline api

* fix linting

* update user guide

* rename

* update

* update

* update

* rollback class name

* update

* remove unused code

* update

* update

* fix ci check

* compatibility

* remove concurrency

* Update configs/models/hf_internlm/lmdeploy_internlm2_chat_7b.py

* Update docs/zh_cn/advanced_guides/evaluation_lmdeploy.md

* [Bug] fix lint

---------

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
2024-10-09 22:58:06 +08:00
Songyang Zhang
d2ab51abbd
[Bug] Fix pre-commit hook (#1592) 2024-10-09 17:09:48 +08:00
x54-729
4d6349dfe1
[FIX] fix interntrain get_loglikelihood (#1584) 2024-10-08 11:34:04 +08:00
zhulinJulia24
89abcba486
[CI] Fix testcase failure (#1582)
* update

* Update oc_score_baseline.yaml

* Update daily-run-test.yml

* Update daily-run-test.yml

* Update daily-run-test.yml

* Update daily-run-test.yml

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-10-02 12:30:38 +08:00
Linchen Xiao
22a4e76511
[BUMP] Bump version to 0.3.3 (#1581) 2024-09-30 16:57:41 +08:00
x54-729
bbdca5eb4c
[BUG] Fix eos token handling and add comments for InternTrain (#1569)
Co-authored-by: x54-729 <xingshuhao.dispatch@pjlab.org.cn>
2024-09-30 15:46:06 +08:00
Linchen Xiao
763d7755b6
[BUG]GaokaoBench dataset fix (#1583) 2024-09-30 15:13:26 +08:00
shijinpjlab
7528b8ab8a
[Feature] Add dingo test (#1529)
* add qa dingo

* update

* change name qa to dingo

* eval model: llm_base

* update path

* change name and move path

* add eval_dingo

* update import

* add for pip

* add dingo package

* change import place

* update import place

* fix lint fail

* isort

* double quoted

---------

Co-authored-by: sj <shijin@pjlab.org.cn>
2024-09-29 19:24:58 +08:00
Yi Ding
85a28874aa
[BUG]: Fix Bailing API configs (#1570) 2024-09-27 11:56:57 +08:00
Songyang Zhang
e8437db98f
[Feature] Update BailingLM/OpenAI verbose (#1568)
* [Feature] 1. Update CoreBench Base\n 2. Fix lint issue in BalingAPI

* Update

* [Feature] Update API

* Update
2024-09-27 11:15:25 +08:00
Songyang Zhang
7d50294117
[Feature] Update Bailing (#1567)
* [Feature] 1. Update CoreBench Base\n 2. Fix lint issue in BalingAPI

* Update

* Update

* Update
2024-09-26 18:56:17 +08:00
Songyang Zhang
a7bacfdf7e
[Feature] Update CoreBench 2.0 (#1566)
* [Feature] 1. Update CoreBench Base\n 2. Fix lint issue in BalingAPI

* Update

* Update
2024-09-26 18:44:00 +08:00
Yi Ding
3f833186dc
[Feature] Support the reasoning from BaiLing LLM (#1541)
* [Feature] Support the reasoning from BaiLing LLM

This commit includes the access to BaiLing LLM and gets the reasoning.

* Add the api example

The example of evalute bailing api

* Revise the generation arguments

Based on current experiment, we update some generation arguments for better reasoning

* [fix] set the batch size

* Retry under flowcontrol of serverside

* add dependent package into requirement.txt

add dependent package retrying to clean up the pre-comment check.

* correct the file names and make the file copy

correct the file names.
copy the files under configs to opencompass

* fix the lint issue

---------

Co-authored-by: christopher.dy <christopher.dy@antgroup.com>
2024-09-26 16:49:52 +08:00
Linchen Xiao
80cda1980e
[BUG] fix followbench dataset config (#1564)
* [BUG] fix followbench dataset config

* [BUG] fix followbench dataset config
2024-09-25 20:58:34 +08:00
zhulinJulia24
aa43eaf267
[CI] add more models into testcase and test env of cu12 (#1558)
* update

* update

* Update pr-run-test.yml

* update

* update

* update

* update

* Update daily-run-test.yml

* update

* updaste

* update

* update

* update

* Update daily-run-test.yml

* update

* update

* Update daily-run-test.yml

* Update daily-run-test.yml

* update

* update

* update

* update

* update

* Update daily-run-test.yml

* update

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-09-25 17:07:27 +08:00
zhulinJulia24
87df8a73a3
[CI] add a common summarizer for qabench summarizer (#1545)
* update

* update

* update

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-09-25 13:40:47 +08:00
Linchen Xiao
c3fb9065db
[Feature] Add dlc sleep time (#1562) 2024-09-25 11:53:48 +08:00
Songyang Zhang
fe84bbd9a0
[Feature] Add Config for CoreBench (#1547)
* [Feature] Add Config for CoreBench

* Update
2024-09-25 11:36:43 +08:00
Chuanyang Jin
17eefc0e1e
[Fix] Correct typos (#1561) 2024-09-25 11:27:17 +08:00
liushz
83eeb52b09
[Feature] Update WikiBench base model config (#1553)
* Update MathBench & WikiBench for FullBench

* Update MathBench & WikiBench for FullBench

* Update GPQA & MMLU_Pro

* Update MathBench & WikiBench for FullBench

* Update MathBench & WikiBench for FullBench

* Update MathBench & WikiBench for FullBench

* Update MathBench & Math base config

* Update WikiBench base model config

---------

Co-authored-by: liushz <liuhongwei@pjlab.rog.cn>
2024-09-25 11:26:36 +08:00
Songyang Zhang
e7681943f3
[Feature] Update the max_out_len for many models (#1559) 2024-09-24 21:52:28 +08:00
bittersweet1999
a2e9bc0c41
[Fix] fix duplicate error in partitioner (#1552)
* fix pip version

* fix pip version

* fix duplicate error in paritioner

* fix duplicate error in paritioner
2024-09-23 19:45:21 +08:00
x54-729
335667183a
[Feature] Add Interntrain model support (#1548)
Co-authored-by: x54-729 <xingshuhao.dispatch@pjlab.org.cn>
2024-09-23 19:10:26 +08:00
klein
24915aeb3f
[BUG] Update CIbench config(#1544)
* BUG: Update cibench.py

* BUG: Update cibench.py
2024-09-23 18:32:27 +08:00
liushz
a0cfd61129
[Feature] Update MathBench & Math base model config (#1550)
* Update MathBench & WikiBench for FullBench

* Update MathBench & WikiBench for FullBench

* Update GPQA & MMLU_Pro

* Update MathBench & WikiBench for FullBench

* Update MathBench & WikiBench for FullBench

* Update MathBench & WikiBench for FullBench

* Update MathBench & Math base config

---------

Co-authored-by: liushz <liuhongwei@pjlab.rog.cn>
2024-09-23 14:03:59 +08:00
Songyang Zhang
ee058e25b2
[Feature] Support verbose for OpenAI API (#1546) 2024-09-20 17:12:52 +08:00
hailsham
a81bbb85bf
[FIX] Added handling for the "begin section" in meta_template to APITemplateParser (#1405)
Co-authored-by: leifei <nuuooo@icloud.com>
2024-09-19 18:12:04 +08:00
Songyang Zhang
5a27c2bd6f
[Model] Support Qwen2.5 Instruct (#1543) 2024-09-19 16:16:07 +08:00
Songyang Zhang
be460fbb21
[Feature] Support OpenAI O1 models (#1539)
* [Feature] Support OpenAI O1 models

* Update README.md

---------

Co-authored-by: liushz <qq1791167085@163.com>
2024-09-18 22:41:17 +08:00
liushz
2e9db77d57
[Feature] Add custom model postprocess function (#1519)
Co-authored-by: liushz <liuhongwei@pjlab.rog.cn>
2024-09-18 14:40:51 +08:00
liushz
c9a7026f59
[Feature] Update MathBench & WikiBench for FullBench (#1521)
* Update MathBench & WikiBench for FullBench

* Update MathBench & WikiBench for FullBench

* Update GPQA & MMLU_Pro

* Update MathBench & WikiBench for FullBench

* Update MathBench & WikiBench for FullBench

* Update MathBench & WikiBench for FullBench

---------

Co-authored-by: liushz <liuhongwei@pjlab.rog.cn>
2024-09-18 14:35:30 +08:00
Songyang Zhang
cfbd308edf
[Doc] Update README (#1528)
* '

* Update
2024-09-14 16:02:17 +08:00
Linchen Xiao
90279b6461
[Feature] Dataset prompts update for ARC, BoolQ, Race (#1527) 2024-09-13 10:30:43 +08:00
Songyang Zhang
6997990c93
[Feature] Update Models (#1518)
* Update Models

* Update

* Update humanevalx

* Update

* Update
2024-09-12 23:35:30 +08:00
zhulinJulia24
3754dc1b67
update (#1522)
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-09-12 15:00:52 +08:00
bittersweet1999
7c7fa36235
[Feature] add support for internal Followbench (#1511)
* fix pip version

* fix pip version

* add internal followbench

* add internal followbench

* fix lint

* fix lint
2024-09-11 13:32:34 +08:00
Linchen Xiao
317763381c
update (#1517) 2024-09-11 13:31:20 +08:00
bittersweet1999
c2bcd8725e
[Fix] Fix wildbench (#1508)
* fix pip version

* fix pip version

* fix_wildbench
2024-09-10 17:35:07 +08:00
Alexander Lam
a31a77c5c1
[Feature] Add SciCode summarizer config (#1514)
* [Feature] added SciCode  summarizer config and dataset config for with background evaluation

* fix lint issues

* removed unnecessary type in summarizer group
2024-09-10 16:06:02 +08:00
Mo Li
5b93592242
[Fix] Fix link-check workflow by adjusting line breaks in URL ignore patterns (#1507)
* update link-check

* update link-check

* update link-check
2024-09-10 10:20:40 +08:00
Linchen Xiao
b5f8afb57b
[Bump] Bump version to 0.3.2.post1 2024-09-06 19:09:30 +08:00
Linchen Xiao
f04f3546bc
[Fix] Import fix (#1500) 2024-09-06 18:29:24 +08:00
Linchen Xiao
ff18545f0e
[Bump] Bump version to 0.3.2 (#1497) 2024-09-06 16:10:45 +08:00
Linchen Xiao
87ffa71d68
[Feature] Longbench dataset update 2024-09-06 15:50:12 +08:00
Albert Yan
928d0cfc3a
[Feature] Add support for Rendu API (#1468)
* Add support for Rendu API

* fix lint issue

* fix lint issue

* fix lint issue

* Update

---------

Co-authored-by: 13190 <zeyu.yan@transn.com>
Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
2024-09-06 01:00:43 +08:00
Hari Seldon
faf5260155
[Feature] Optimize Evaluation Speed of SciCode (#1489)
* update scicode

* update comments

* remove redundant variable

* Update

---------

Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
2024-09-06 00:59:41 +08:00
liushz
00fc8da5be
[Feature] Add model postprocess function (#1484)
* Add model postprocess function

* Add model postprocess function

* Add model postprocess function

* Add model postprocess function

* Add model postprocess function

* Add model postprocess function

* Add model postprocess function

* Add model postprocess function

---------

Co-authored-by: liushz <liuhongwei@pjlab.rog.cn>
2024-09-05 21:10:29 +08:00
Maxime SHE
45efdc994d
[Feature] Add an attribute api_key into TurboMindAPIModel default None (#1475)
Co-authored-by: Maxime <maximeshe@163.com>
Add an attribute api_key into TurboMindAPIModel default None then we can set the api_key while using lmdeploy to deploy the llm model
2024-09-05 17:51:16 +08:00
Linchen Xiao
6c9cd9a260
[Feature] Needlebench auto-download update (#1480)
* update

* update

* update
2024-09-05 17:22:42 +08:00
zhulinJulia24
716d46e1f5
[ci] fix badcase and add env info (#1491)
* update

* update

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-09-05 16:43:45 +08:00
zhulinJulia24
fb6a0df652
[ci] fix test env for vllm and add vllm baselines (#1481)
* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-09-04 19:24:09 +08:00
Linchen Xiao
da74cbfa39
[Fix] Model configs update 2024-09-04 18:57:10 +08:00
Linchen Xiao
95aad6c282
[Fix] Requirements update 2024-09-03 18:50:40 +08:00
Linchen Xiao
9693be46b7
[Feature] Mmlu-pro auto-download (#1464)
* update

* update

* update

* update

* update
2024-08-30 10:03:40 +08:00
zhulinJulia24
f34209766d
[ci] fix test env (#1470)
* Update daily-run-test.yml

* Update daily-run-test.yml

* Update pr-run-test.yml

* Update daily-run-test.yml

* update

* update

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-08-29 14:48:17 +08:00
Alexander Lam
8b39225259
[Feature] Added extra_body support for OpenAISDK; Added support for proxy URL when connecting to OpenAI's API. (#1467)
* fix lint issues

* fix lint issues
2024-08-29 00:43:43 +08:00
Guoli Yin
a488b9b4f5
[Feature] Make OPENAI_API_BASE compatible with openai default env (#1461)
* Make OPENAI_API_BASE compatible with openai default env

* Make OPENAI_API_BASE compatible with openai default env

---------

Co-authored-by: Guoli Yin <gyin@icloud.com>
2024-08-28 23:14:41 +08:00
Songyang Zhang
e5a8eb2283
[Feature] Update Lint and Leaderboard (#1458)
* [Feature] Update Lint and Leaderboard

* Update

* Update
2024-08-28 22:36:42 +08:00
Linchen Xiao
245664f4c0
[Feature] Fullbench v0.1 language update (#1463)
* update

* update

* update

* update
2024-08-28 14:01:05 +08:00
CHEN PENGAN
463231c651
[Feature] Add icl_sliding_k_retriever.py and update __init__.py (#1305)
* Add icl_sliding_k_retriever.py and update __init__.py

* Fix flake8, isort, and yapf issues for Sliding Window Retriever
2024-08-23 17:18:31 +08:00
Linchen Xiao
94b6bd65fc
[Fix] Fix cli evaluation for multiple models (#1454)
* update

* update
2024-08-23 17:15:36 +08:00
Linchen Xiao
2295a33a18
[Doc] Update readme (#1453) 2024-08-23 14:11:01 +08:00
Songyang Zhang
5485207fbe
[Bump] Bump version to 0.3.1 (#1450)
* [Bump] Bump version 0.3.1

* Update
2024-08-23 10:47:57 +08:00
Songyang Zhang
7c2d25b557
[Fix] Update SciCode and Gemma model (#1449)
* [Fix] Update SciCode and Gemma model

* Update

* Update
2024-08-23 10:42:27 +08:00
Xu Song
ad3931aa32
Update openicl_infer.py (#1308) 2024-08-23 10:39:22 +08:00
zhulinJulia24
fb69ba5eb8
[CI] add commond testcase into daily testcase (#1447)
* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-08-23 01:49:17 +08:00
liushz
9fdbc744dc
[Fix] Update option postprocess & mathbench language summarizer (#1413)
* Update option postprocess & mathbench language summarizer

* Update option postprocess & mathbench language summarizer

---------

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-08-22 14:49:07 +08:00
Linchen Xiao
0fe9756c5d
[Doc] Update Readme (#1439)
* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update
2024-08-22 14:48:45 +08:00
Hari Seldon
14b4b735cb
[Feature] Add support for SciCode (#1417)
* add SciCode

* add SciCode

* add SciCode

* add SciCode

* add SciCode

* add SciCode

* add SciCode

* add SciCode w/ bg

* add scicode

* Update README.md

* Update README.md

* Delete configs/eval_SciCode.py

* rename

* 1

* rename

* Update README.md

* Update scicode.py

* Update scicode.py

* fix some bugs

* Update

* Update

---------

Co-authored-by: root <HariSeldon0>
Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
2024-08-22 13:42:25 +08:00
liushz
d3963bceae
[Bug] Add model support for 'huggingface_above_v4_33' when using '-a' (#1430)
Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-08-22 13:40:24 +08:00
seetimee
ac093fce53
[Update] Update openai_api.py (#1438)
Most models' token limits are above 32k. It will fix long context dataset test bug of skiping some data.
2024-08-21 18:57:49 +08:00
liushz
e076dc5acf
[Fix] Fix openai api tiktoken bug for api server (#1433)
* Fix openai api tiktoken

* Fix openai api tiktoken

---------

Co-authored-by: liushz <liuhongwei@pjlab.rog.cn>
2024-08-20 22:02:14 +08:00
Linchen Xiao
a4b54048ae
[Feature] Add Ruler datasets (#1310)
* [Feature] Add Ruler datasets

* pre-commit fixed

* Add model specific tokenizer to dataset

* pre-commit modified

* remove unused import

* fix linting

* add trust_remote to tokenizer load

* lint fix

* comments resolved

* fix lint

* Add readme

* Fix lint

* ruler refactorize

* fix lint

* lint fix

* updated

* lint fix

* fix wonderwords import issue

* prompt modified

* update

* readme updated

* update

* ruler dataset added

* Update

---------

Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
2024-08-20 11:40:11 +08:00
Xu Song
99b5122ed5
[Feature] Add abbr for rolebench dataset (#1431)
* Add abbr for rolebench dataset

* add
2024-08-20 11:22:48 +08:00
Linchen Xiao
ecf9bb3e4c
[Bug] Commonsenseqa dataset fix (#1425)
* longbench dataset load fix

* update

* Update

* Update

* Update

* update

* update

---------

Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
2024-08-16 15:54:07 +08:00
Songyang Zhang
9b3613f10b
[Update] Support auto-download of FOFO/MT-Bench-101 (#1423)
* [Update] Support auto-download of FOFO/MT-Bench-101

* Update wildbench
2024-08-16 11:57:41 +08:00
bittersweet1999
ce7f4853ce
[Fix] Sub summarizer order fix (#1426)
* fix pip version

* fix pip version

* fix sub summarizer order

* fix order
2024-08-15 21:08:18 +08:00
Linchen Xiao
2596f226f4
[Fix] longbench dataset load fix (#1422) 2024-08-15 11:30:30 +08:00
Linchen Xiao
8e55c9c6ee
[Update] Compassbench v1.3 (#1396)
* stash files

* compassbench subjective evaluation added

* evaluation update

* fix lint

* update docs

* Update lint

* changes saved

* changes saved

* CompassBench subjective summarizer added (#1349)

* subjective summarizer added

* fix lint

[Fix] Fix MathBench (#1351)

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>

[Update] Update model support list (#1353)

* fix pip version

* fix pip version

* update model support

subjective summarizer updated

knowledge, math objective done (data need update)

remove secrets

objective changes saved

knowledge data added

* secrets removed

* changed added

* summarizer modified

* summarizer modified

* compassbench coding added

* fix lint

* objective summarizer updated

* compass_bench_v1.3 updated

* update files in config folder

* remove unused model

* lcbench modified

* removed model evaluation configs

* remove duplicated sdk implementation

---------

Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn>
2024-08-12 19:09:19 +08:00
changyeyu
59586a8b4a
[Feature] Enable Truncation of Mid-Section for Long Prompts in huggingface_above_v4_33.py (#1373)
* Retain the first and last halves of the tokens from the prompt, discarding the middle, to avoid exceeding the model's maximum length.

* Add default parameter: mode

* Modified a comment.

* Modified variable names.

* fix yapf lint
2024-08-09 11:36:30 +08:00
Songyang Zhang
88eb91219b
[Doc] Update README (#1404)
* [Doc] Update README

* Update
2024-08-08 16:18:33 +08:00
yaoyingyy
decb621ff6
[Fix] the issue where scores are negative in the Lawbench dataset evaluation(#1402) (#1403) 2024-08-08 16:08:26 +08:00
Yunlin Mao
818d72a650
[Fix] modelscope dataset load problem (#1406)
* fix modelscope dataset load

* fix lint
2024-08-08 14:01:06 +08:00
Songyang Zhang
264fd23129
[Bump] Bump version for v0.3.0 (#1398) 2024-08-07 01:25:24 +08:00
Songyang Zhang
fed1a4998b
[Fix] Fix CaLM import (#1395) 2024-08-06 12:17:45 +08:00
Songyang Zhang
c81329b548
[Fix] Fix Slurm ENV (#1392)
1. Support Slurm Cluster
2. Support automatic data download
3. Update InternLM2.5-1.8B/20B-Chat
2024-08-06 01:35:20 +08:00
Songyang Zhang
c09fc79ba8
[Feature] Support OpenAI ChatCompletion (#1389)
* [Feature] Support import configs/models/summarizers from whl

* Update

* Update openai sdk

* Update

* Update gemma
2024-08-01 19:10:13 +08:00
Peng Bo
07c96ac659
Calm dataset (#1385)
* Add CALM Dataset
2024-08-01 10:03:21 +08:00
Songyang Zhang
46cc7894e1
[Feature] Support import configs/models/summarizers from whl (#1376)
* [Feature] Support import configs/models/summarizers from whl

* Update LCBench configs

* Update

* Update

* Update

* Update

* update

* Update

* Update

* Update

* Update

* Update
2024-08-01 00:42:48 +08:00
Mo Li
b83396f57c
add 1m config (#1383) 2024-07-31 14:53:51 +08:00
klein
52eccc4f0e
[Fix] Fix version mismatch of CIBench (#1380)
* update crb

* update crbbench

* update crbbench

* update crbbench

* minor update wildbench

* [Fix] Update doc of wildbench, and merge wildbench into subjective

* [Fix] Update doc of wildbench, and merge wildbench into subjective, fix crbbench

* Update crb.md

* Update crb_pair_judge.py

* Update crb_single_judge.py

* Update subjective_evaluation.md

* Update openai_api.py

* [Update] update wildbench readme

* [Update] update wildbench readme

* [Update] update wildbench readme, remove crb

* Delete configs/eval_subjective_wildbench_pair.py

* Delete configs/eval_subjective_wildbench_single.py

* Update __init__.py

* [Fix] fix version mismatch for CIBench

* [Fix] fix version mismatch for CIBench, local runer

* [Fix] fix version mismatch for CIBench, local runer, remove oracle mode

---------

Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>
2024-07-30 17:51:24 +08:00
Songyang Zhang
33ceaa0eb8
[Bug] Fix bug in turbomind (#1377) 2024-07-30 09:37:50 +08:00
Songyang Zhang
eee5a5be23
[Fix] Update get_data_path for LCBench and HumanEval (#1375) 2024-07-29 19:28:09 +08:00
QXY
fea11b1d20
[Feature] add support for hf_pulse_7b (#1255)
* add support for hf_pulse_7b

* Update hf_pulse_7b.py
2024-07-29 19:01:52 +08:00
Songyang Zhang
704853e5e7
[Feature] Update pip install (#1324)
* [Feature] Update pip install

* Update Configuration

* Update

* Update

* Update

* Update Internal Config

* Update collect env
2024-07-29 18:32:50 +08:00
Xingjun.Wang
edab1c07ba
[Feature] Support ModelScope datasets (#1289)
* add ceval, gsm8k modelscope surpport

* update race, mmlu, arc, cmmlu, commonsenseqa, humaneval and unittest

* update bbh, flores, obqa, siqa, storycloze, summedits, winogrande, xsum datasets

* format file

* format file

* update dataset format

* support ms_dataset

* udpate dataset for modelscope support

* merge myl_dev and update test_ms_dataset

* udpate dataset for modelscope support

* update readme

* update eval_api_zhipu_v2

* remove unused code

* add get_data_path function

* update readme

* remove tydiqa japanese subset

* add ceval, gsm8k modelscope surpport

* update race, mmlu, arc, cmmlu, commonsenseqa, humaneval and unittest

* update bbh, flores, obqa, siqa, storycloze, summedits, winogrande, xsum datasets

* format file

* format file

* update dataset format

* support ms_dataset

* udpate dataset for modelscope support

* merge myl_dev and update test_ms_dataset

* update readme

* udpate dataset for modelscope support

* update eval_api_zhipu_v2

* remove unused code

* add get_data_path function

* remove tydiqa japanese subset

* update util

* remove .DS_Store

* fix md format

* move util into package

* update docs/get_started.md

* restore eval_api_zhipu_v2.py, add environment setting

* Update dataset

* Update

* Update

* Update

* Update

---------

Co-authored-by: Yun lin <yunlin@U-Q9X2K4QV-1904.local>
Co-authored-by: Yunnglin <mao.looper@qq.com>
Co-authored-by: Yun lin <yunlin@laptop.local>
Co-authored-by: Yunnglin <maoyl@smail.nju.edu.cn>
Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn>
2024-07-29 13:48:32 +08:00
jxd
12b84aeb3b
[Feature] Update CHARM Memeorziation (#1230)
* update gemini api and add gemini models

* add openai models

* update CHARM evaluation

* add CHARM memorization tasks

* add CharmMemSummarizer (output eval details for memorization-independent reasoning analysis

* update CHARM readme

---------

Co-authored-by: wujiang <wujiang@pjlab.org.cn>
2024-07-26 18:42:30 +08:00
bittersweet1999
d3782c1d47
Revert "Calm dataset (#1287)" (#1366)
This reverts commit edd0ffdf70.
2024-07-26 18:27:29 +08:00
Xu Song
9b9855a008
Add en and zh groups to longbench summarizer; Fix longbench overall score (#1216)
* Add longbench groups

* update

* update
2024-07-26 11:50:41 +08:00
Peng Bo
edd0ffdf70
Calm dataset (#1287)
* add calm dataset

* modify config max_out_len

* update README

* Modify README

* update README

* update README

* update README

* update README

* update README

* add summarizer and modify readme

* delete summarizer config comment

* update summarizer

* modify same response to all questions

* update README
2024-07-26 11:48:16 +08:00
mqy004
a08931f214
[Fix] origin_prompt should be None in llm-compression task (#1225)
Co-authored-by: Qinyang Mou <qinyang_mou@intsig.net>
2024-07-26 11:46:02 +08:00
LeavittLang
8ee7fecb68
Adding support for Doubao API (#1218)
* Adding support for Doubao API

* Update doubao_api.py

Fixed the bug that the connection would be retried even if it was normal.

* Update doubao_api.py

---------

Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>
2024-07-26 11:44:51 +08:00
klein
65fad8e2ac
[Fix] minor update wildbench (#1335)
* update crb

* update crbbench

* update crbbench

* update crbbench

* minor update wildbench

* [Fix] Update doc of wildbench, and merge wildbench into subjective

* [Fix] Update doc of wildbench, and merge wildbench into subjective, fix crbbench

* Update crb.md

* Update crb_pair_judge.py

* Update crb_single_judge.py

* Update subjective_evaluation.md

* Update openai_api.py

* [Update] update wildbench readme

* [Update] update wildbench readme

* [Update] update wildbench readme, remove crb

* Delete configs/eval_subjective_wildbench_pair.py

* Delete configs/eval_subjective_wildbench_single.py

* Update __init__.py

---------

Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>
2024-07-26 11:19:04 +08:00
baymax591
51a94aee01
[Bug] fix bug: delete & (#1365)
Co-authored-by: 白超 <baichao19@huawei.com>
2024-07-26 11:03:55 +08:00
Mo Li
69aa2f2d57
[Feature] Make NeedleBench available on HF (#1364)
* update_lint

* update_huggingface format

* fix bug

* update docs
2024-07-25 19:01:56 +08:00
Fengzhe Zhou
c3c02c2960
update docs (#1318)
* update docs

* 高效评测 -> 数据分片

* update

* update

* Update faq.md

---------

Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>
2024-07-25 18:44:25 +08:00
heya5
73aa55af6d
[Fix] Support HF models deployed with an OpenAI-compatible API. (#1352)
* Support HF models deployed with an OpenAI-compatible API.

* resolve lint issue

* add extra_body arguments

There are many other arguments when using openi-compatiable API like this: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-chat-api

* fix linting issue

* fix yapf linting issue
2024-07-25 18:38:23 +08:00
WANG WENJIN
0aad8199c7
Fix the summary error in subjective.py (#1363) 2024-07-25 18:36:13 +08:00
bittersweet1999
8fe75e9937
[Update] update Subeval demo config (#1358)
* fix pip version

* fix pip version

* update demo config
2024-07-24 15:48:28 +08:00
bittersweet1999
86b6d18731
[Update] Update model support list (#1353)
* fix pip version

* fix pip version

* update model support
2024-07-23 13:35:58 +08:00
liushz
cf3e942f73
[Fix] Fix MathBench (#1351)
Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-07-23 13:35:38 +08:00
Linchen Xiao
8127fc3518
CompassBench subjective summarizer added (#1349)
* subjective summarizer added

* fix lint
2024-07-23 12:29:57 +08:00
Que Haoran
a244453d9e
[Feature] Support inference ppl datasets (#1315)
* commit inference ppl datasets

* revised format

* revise

* revise

* revise

* revise

* revise

* revise
2024-07-22 17:59:30 +08:00
Xu Song
e9384823f2
Upgrade default math pred_postprocessor (#1340)
* Change default math postprocessor

* Update math_gen_265cce.py
2024-07-22 14:00:49 +08:00
Songyang Zhang
96f644de69
[Fix] Update path and folder (#1344)
* Update path and folder

* Update path

---------

Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn>
2024-07-21 08:18:14 +08:00
Linchen Xiao
a56678190b
[Feature] CompassBench v1_3 subjective evaluation (#1341)
* stash files

* compassbench subjective evaluation added

* evaluation update

* remove unneeded content

* fix lint

* update docs

* Update lint

* Update

---------

Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn>
2024-07-19 23:12:23 +08:00
liushz
98c58f8a6c
[Feature] Add compassbench knowledge&math part (#1342)
* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Fix Llama-3 meta template

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Update acclerator

* Update MathBench

* Update accelerator

* Add Doc for accelerator

* Add Doc for accelerator

* Add Doc for accelerator

* Add Doc for accelerator

* Update compassbench august wiki&math

* Update compassbench august wiki&math

* Update compassbench august wiki&math

* Update compassbench_aug_gen_068af0.py

* Update compassbench_aug_gen_068af0.py

* Update

---------

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn>
2024-07-19 22:54:46 +08:00
bittersweet1999
1f9f728f22
[Feature] support compassbench Checklist evaluation (#1339)
* fix pip version

* fix pip version

* support checklist eval

* init

* add lan

* fix typo
2024-07-19 16:40:44 +08:00
Mo Li
f40add2596
[Fix] Fix lint (#1334)
* update needlebench docs

* update model_name_mapping dict

* update README

* fix_lint
2024-07-18 17:15:06 +08:00
Xu Song
1bfb4217ff
Fix typing and typo (#1331) 2024-07-18 13:41:24 +08:00
Mo Li
104bddf647
[Doc] Update NeedleBench Docs (#1330)
* update needlebench docs

* update model_name_mapping dict

* update README

* Update README_zh-CN.md

---------

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2024-07-18 13:16:19 +08:00
Xu Song
0a1c89e618
[Fix] Fix rouge evaluator of rolebench_zh (#1322) 2024-07-16 16:18:13 +08:00
bittersweet1999
3aeabbc427
[Fix] update Faq (#1313)
* fix pip version

* fix pip version

* update faq

* update faq

* update faq

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-07-12 11:29:26 +08:00
bittersweet1999
8e7ad2e981
[Fix] add bc for alignbench summarizer (#1306)
* fix pip version

* fix pip version

* fix alignbench

* fix import error
2024-07-12 11:06:20 +08:00
Fengzhe Zhou
62f55987f1
force register (#1311) 2024-07-11 19:59:35 +08:00
bittersweet1999
889e7e1140
[Fix] Change abbr for arenahard dataset (#1302)
* fix pip version

* fix pip version

* change abbr for arenahard
2024-07-11 12:42:03 +08:00
Fengzhe Zhou
a62c613d3e
[Sync] bump version 0.2.6+local (#1294) 2024-07-06 00:44:06 +08:00
Fengzhe Zhou
1d3a26c732
[Doc] quick start swap tabs (#1263)
* [doc] quick start swap tabs

* update docs

* update

* update

* update

* update

* update

* update

* update
2024-07-05 23:51:42 +08:00
bittersweet1999
68ca48496b
[Refactor] Reorganize subjective eval (#1284)
* fix pip version

* fix pip version

* reorganize subjective eval

* reorg sub

* reorg subeval

* reorg subeval

* update subjective doc

* reorg subeval

* reorg subeval
2024-07-05 22:11:37 +08:00
Songyang Zhang
aadcfa625f
[Feat] Update owners for issues (#1293)
* [Feat] Update owners for issues

* update owners

---------

Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-07-05 18:27:30 +08:00
Songyang Zhang
409a042d93
[Feature] Add InternLM2.5 (#1286)
* [Feature] Add InternLM2.5

* Update

* update readme

---------

Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-07-04 20:10:31 +08:00
zhulinJulia24
167cfdcca3
[ci] update daily testcase (#1285)
* Update daily-run-test.yml

* Create eval_regression_chat.py

* Delete .github/scripts/.github/scripts/eval_regression_chat.py

* Create eval_regression_chat.py

* Update pr-run-test.yml

* Update daily-run-test.yml

* Update daily-run-test.yml

* Update daily-run-test.yml

* Update oc_score_baseline.yaml

* Update oc_score_assert.py

* Update daily-run-test.yml

* Update daily-run-test.yml

* Update oc_score_baseline.yaml

* Update oc_score_assert.py

* Update oc_score_assert.py

* fix lint

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Update daily-run-test.yml

* update

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-07-03 18:56:09 +08:00
baymax591
28eba6fe34
npu适配 (#1250)
* npu适配

* Add suport for Ascend NPU

* format

---------

Co-authored-by: baymax591 <14428251+baymax591@user.noreply.gitee.com>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-07-03 18:55:19 +08:00
liushz
fc2c9dea8c
Update MathBench summarizer & fix cot setting (#1282)
* Update MathBench

* Update MathBench

* Update MathBench

---------

Co-authored-by: liushz <liuhongwei@pjlab.rog.cn>
2024-07-01 21:51:17 +08:00
Fengzhe Zhou
a32f21a356
[Sync] Sync with internal codes 2024.06.28 (#1279) 2024-06-28 14:16:34 +08:00
Xingyuan Bu
842fb1cd70
Update mtbench101.py (#1276)
fix wrong-used import
from torch.utils.data import DataLoader, Dataset
2024-06-26 00:40:22 +08:00
zhulinJulia24
26d077b080
flash attn installation in daily testcase (#1272)
* Update daily-run-test.yml

* Update daily-run-test.yml

* Update oc_score_baseline.yaml
2024-06-24 18:22:46 +08:00
liushz
e5ee1647fb
Add doc for accelerator function (#1252)
* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Fix Llama-3 meta template

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Update acclerator

* Update MathBench

* Update accelerator

* Add Doc for accelerator

* Add Doc for accelerator

* Add Doc for accelerator

* Add Doc for accelerator

---------

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-06-24 14:53:51 +08:00
klein
1fa62c4a42
Support wildbench (#1266)
Co-authored-by: Leymore <zfz-960727@163.com>
2024-06-24 13:16:27 +08:00
LIU Xiao
83b9fd9eaa
add ",<2.0.0" to "numpy>=1.23.4" in requirements/runtime.txt, as pandas<2.0.0 doesn't compatible with numpy>=2.0.0 (#1267) 2024-06-24 11:03:42 +08:00
bittersweet1999
e0d7808b4e
[Fix] fix pip version (#1228)
* fix pip version

* fix pip version
2024-06-06 11:48:07 +08:00
bittersweet1999
982e024540
[Feature] add dataset Fofo (#1224)
* add fofo dataset

* add dataset fofo
2024-06-06 11:40:48 +08:00
Xingyuan Bu
02a0a4e857
MT-Bench-101 (#1215)
* add mt-bench-101

* add readme and requirements

* add mt-bench-101 data

* Update readme_mtbench101.md

* update readme

* update leaderboard

* fix typo

* Update readme_mtbench101.md

* fit newest opencompass

* update readme.md

* mtbench101 to opencompass

* mtbench101 to opencompass

* for code review

* for code review

* for code review

* hook

* hook

---------

Co-authored-by: liujie <ljie@buaa.edu.cn>
2024-06-03 14:52:12 +08:00
mqy004
b272803d8a
解决release版本安装后不能导入opencompass.cli.main的问题 (#1221)
* Create __init__.py

* Create __init__.py

* Create __init__.py

* Create __init__.py

* Create __init__.py

* Create __init__.py

* format

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-05-31 13:23:33 +08:00
bittersweet1999
7c381e5be8
[Fix] fix summarizer (#1217)
* fix summarizer

* fix summarizer
2024-05-31 11:40:47 +08:00
Fengzhe Zhou
a77b8a5cec
[Sync] format (#1214) 2024-05-30 00:21:58 +08:00
Fengzhe Zhou
d59189b87f
[Doc] Update running command in README (#1206) 2024-05-30 00:06:39 +08:00
Fengzhe Zhou
0b50112dc1
[Fix] Rollback opt model configs (#1213) 2024-05-30 00:03:22 +08:00
Fengzhe Zhou
d656e818f8
[Docs] Remove --no-batch-padding and Use --hf-num-gpus (#1205)
* [Docs] Remove --no-batch-padding and Use -hf-num-gpus

* update
2024-05-29 16:30:10 +08:00
Xu Song
808582d952
Fix VLLM argument error (#1207) 2024-05-29 10:14:08 +08:00
Fengzhe Zhou
2954913d9b
[Sync] bump version (#1204) 2024-05-28 23:09:59 +08:00
liushz
ba620c4afe
Update accelerator (#1195)
* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Fix Llama-3 meta template

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Update acclerator

* Update MathBench

* Update accelerator

---------

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-05-28 17:17:54 +08:00
Fengzhe Zhou
9fa80b0f93
[Feat] Update charm summary (#1194) 2024-05-27 16:17:01 +08:00
jxd
608ff5810d
support CHARM (https://github.com/opendatalab/CHARM) reasoning tasks (#1190)
* support CHARM (https://github.com/opendatalab/CHARM) reasoning tasks

* fix lint error

* add dataset card for CHARM

* minor refactor

* add txt

---------

Co-authored-by: wujiang <wujiang@pjlab.org.cn>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-05-27 13:48:22 +08:00
bittersweet1999
07a6dacf33
fix length (#1180) 2024-05-24 23:30:01 +08:00
bittersweet1999
88c14d3d04
add support for lmdeploy api judge (#1193) 2024-05-24 23:28:56 +08:00
yaoyingyy
749e4cea71
[Fix] temporary files using tempfile (#1186)
Co-authored-by: yaoying <yaoying@kingsoft.com>
2024-05-24 23:27:37 +08:00
klein
5eb8f14d97
[Fix] Fix drop_gen.py (#1191)
Fix the bug in drop_gen: wrong import
2024-05-24 23:17:50 +08:00
bittersweet1999
31afe87026
fix yi-chat template (#1178) 2024-05-21 18:14:12 +08:00
liushz
1448be00e2
Update MathBench (#1176)
* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Fix Llama-3 meta template

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Update acclerator

* Update MathBench

---------

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-05-21 14:45:43 +08:00
Fengzhe Zhou
2b3d4150f3
[Sync] update evaluator (#1175) 2024-05-21 14:22:46 +08:00
zhulinJulia24
296ea59931
Update daily-run-test.yml (#1173) 2024-05-20 14:04:58 +08:00
Fengzhe Zhou
5de85406ce
[Sync] add OC16 entry (#1171) 2024-05-17 16:50:58 +08:00
zhulinJulia24
94eb90569f
update test workflow (#1167)
* Update pr-run-test.yml

* Update daily-run-test.yml

* Update daily-run-test.yml

* Update pr-run-test.yml

* Update daily-run-test.yml

* Update daily-run-test.yml

* Update daily-run-test.yml

* Update daily-run-test.yml

* Update oc_score_baseline.yaml

* Update daily-run-test.yml

* Update oc_score_assert.py

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-05-16 15:32:57 +08:00
Fengzhe Zhou
8ea2c404d7
[Feat] enable HuggingFacewithChatTemplate with --accelerator via cli (#1163)
* enable HuggingFacewithChatTemplate with --accelerator via cli

* rm vllm_internlm2_chat_7b
2024-05-15 21:51:07 +08:00
liushz
e3c0448bbc
Update accelerator (#1152)
* Update acclerator

* update run

---------

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
Co-authored-by: Fengzhe Zhou <zfz-960727@163.com>
2024-05-15 14:31:47 +08:00
Fengzhe Zhou
f10dd48f9c
[Fix] Update stop_words in huggingface_above_v4_33 (#1160) 2024-05-15 14:10:33 +08:00
Fengzhe Zhou
80f831b425
[Fix] use ProcessPoolExecutor during mbpp eval (#1159) 2024-05-15 13:48:29 +08:00
bittersweet1999
8a8987be0b
fix arenahard summarizer (#1154)
Co-authored-by: Leymore <zfz-960727@163.com>
2024-05-15 13:31:29 +08:00
Fengzhe Zhou
62dbf04708
[Sync] update github workflow (#1156) 2024-05-14 22:42:23 +08:00
Fengzhe Zhou
aa2dd2b58c
[Format] Add config lints (#892) 2024-05-14 15:35:58 +08:00
Xu Song
3dbba11945
[Feat] Support dataset_suffix check for mixed configs (#973)
* [Feat] Support dataset_suffix check for mixed configs

* update mixed suffix

* update suffix

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-05-14 15:03:28 +08:00
Fengzhe Zhou
7505b3cadf
[Feature] Add huggingface apply_chat_template (#1098)
* add TheoremQA with 5-shot

* add huggingface_above_v4_33 classes

* use num_worker partitioner in cli

* update theoremqa

* update TheoremQA

* add TheoremQA

* rename theoremqa -> TheoremQA

* update TheoremQA output path

* rewrite many model configs

* update huggingface

* further update

* refine configs

* update configs

* update configs

* add configs/eval_llama3_instruct.py

* add summarizer multi faceted

* update bbh datasets

* update configs/models/hf_llama/lmdeploy_llama3_8b_instruct.py

* rename class

* update readme

* update hf above v4.33
2024-05-14 14:50:16 +08:00
Mo Li
6c711cb262
[Fix] Fix Needlebench Summarizer (#1143)
* update few-shot example

* add 128k
2024-05-13 15:59:34 +08:00
bittersweet1999
5432dfc1ff
fix multiround (#1146) 2024-05-13 15:58:39 +08:00
bittersweet1999
833a35140b
[Fix] fix alpacaeval while add caching path (#1139)
* fix alpacaeval

* fix alpacaeval
2024-05-11 14:02:26 +08:00
Fengzhe Zhou
19d7e630d6
[Sync] Update accelerator (#1122)
(cherry picked from commit 4beb6d9ab655d8a626971841b7acfd9fae9d438f)

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-05-09 14:32:31 +08:00
Alexander Lam
a71122ee18
[Feature] Add Qwen1.5 MoE 7b and Mixtral 8x22b model configs (#1123)
* added qwen moe and mixtral 8x22 model configs

* updated README files news section
2024-05-09 11:04:26 +08:00
Mo Li
cb080fa7de
[Fix] Fix NeedleBench Summarizer Typo (#1125)
* update needleinahaystack eval docs

* update needlebench summarizer

* fix english docs typo
2024-05-08 20:00:15 +08:00
bittersweet1999
826d8307ac
fix links (#1120) 2024-05-08 15:13:18 +08:00
JuhaoLiang
d2c40e5648
[Feature] Add AceGPT-MMLUArabic benchmark (#1099)
* add AceGPT-MMLUArabic benchmark

* update readme and fix lint issue

* remove unused package

* add MMLUArabic zero-shot settings

* rename filename and update readme
2024-05-08 15:00:26 +08:00
Fangyu Lei
862044fb7d
[Feature] Add S3Eval Dataset (#916)
* s3eval_branch

* update s3eval
2024-05-06 19:41:52 +08:00
Xu Song
d501710155
[Fix] Fix AGIEval chinese sets (#972)
* [Fix] Fix AGIEval chinese sets

* Create agieval_gen_617738.py

* [Fix] Fix AGIEval chinese sets

* Restore agieval_gen_64afd3.py

* Update agieval_gen.py

* Create agieval_mixed_0fa998.py

* Update agieval_mixed.py
2024-05-06 15:31:42 +08:00
Yggdrasill7D6
af10ecc272
add mgsm datasets (#1081)
* add mgsm datasets

* fix lint

* fix lint

* update mgsm

* update mgsm

* ease code spell

* update

* update

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-05-06 15:29:34 +08:00
klein
153c4fc988
[Feature] update drop dataset from openai simple eval (#1092)
* [Feature] update drop dataset from openai simple eval

* update drop template presentation

* update

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-05-06 13:37:08 +08:00
Fengzhe Zhou
d43392a3bb
[Feature] Add mmlu prompt from simple_evals, openai (#1074)
* add mmlu prompt from simple_evals, openai

* return empty str on failure
2024-05-06 13:26:26 +08:00
Yang Yong
53fe390454
fix LightllmApi workers bug (#1113) 2024-04-30 22:09:22 +08:00
Fengzhe Zhou
baed2ed9b8
update pre-commit (#891) 2024-04-30 10:59:41 +08:00
Alexander Lam
35c94d0cde
[Feature] Adding support for LLM Compression Evaluation (#1108)
* fixed formatting based on pre-commit tests

* fixed typo in comments; reduced the number of models in the eval config

* fixed a bug in LLMCompressionDataset, where setting samples=None would result in passing test[:None] to load_dataset

* removed unnecessary variable in _format_table_pivot; changed lark_reporter message to English
2024-04-30 10:51:01 +08:00
Ikko Eltociear Ashimine
9c79224b39
[Docs] Update README.md (#1110)
requiresments -> requirements
2024-04-30 00:45:33 +08:00
bittersweet1999
3de48e9b35
[Bug] Fix CMB dataset (#1106) 2024-04-30 00:33:43 +08:00
Songyang Zhang
063f5f5f49
[Update] Update performance of common benchmarks (#1109)
* [Update] Update performance of common benchmarks

* [Update] Update performance of common benchmarks

* [Update] Update performance of common benchmarks
2024-04-30 00:09:08 +08:00
liushz
a6f67e1a65
[Fix] Fix Math Evaluation with Judge Model Evaluator & Add README (#1103)
* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Fix Llama-3 meta template

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

---------

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-04-28 21:58:58 +08:00
bittersweet1999
0b7de67c4a
fix prompt template (#1104) 2024-04-28 21:54:30 +08:00
Lyu Han
1013dce60c
adapt to lmdeploy v0.4.0 (#1073)
* adapt to lmdeploy v0.4.0

* compatible
2024-04-28 19:57:40 +08:00
Yggdrasill7D6
58a57a4c45
[Feature] add support for Flames datasets (#1093)
* add flames datasets

* fix lint

* rm quota

* add judgemodel info and fix os path

* support flames dataset

* support flames dataset

---------

Co-authored-by: bittersweet1999 <1487910649@qq.com>
2024-04-28 18:56:24 +08:00
Mo Li
76dd814c4d
[Doc] Update NeedleInAHaystack Docs (#1102)
* update NeedleInAHaystack Test Docs

* update docs
2024-04-28 18:51:47 +08:00
dmitrysarov
cce5b6fbb6
fix output typing, change mutable list to immutable tuple (#989)
* fix output typing, change mutable list to immutable tuple

* import missed type

* format

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-04-26 23:07:34 +08:00
binary-husky
701ecbb292
[Fix] python path bug (#1063)
* fix relative path bug

* format

---------

Co-authored-by: hmp <505030475@qq.com>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-04-26 21:58:45 +08:00
Wang Xingjin
048d41a1c4
add vllm get_ppl (#1003)
* add vllm get_ppl

* add vllm get_ppl

* format

---------

Co-authored-by: xingjin.wang <xingjin.wang@mihoyo.com>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-04-26 21:31:56 +08:00
Haodong Duan
3a232db471
[Deperecate] Remove multi-modal related stuff (#1072)
* Remove MultiModal

* update index.rst

* update README

* remove mmbench codes

* update news

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-04-26 21:20:14 +08:00
Francis-llgg
f1ee11de14
[Feature] Add gpqa prompt from simple_evals, openai (#1080)
* add gpqa_openai_simple_eval

* 触发CI构建

* reorg

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-04-26 20:13:00 +08:00
klein
e4830a6926
Update CIBench (#1089)
* modify the requirements/runtime.txt: numpy==1.23.4 --> numpy>=1.23.4

* update cibench: dataset and evluation

* cibench summarizer bug

* update cibench

* move extract_code import

---------

Co-authored-by: zhangchuyu@pjlab.org.cn <zhangchuyu@pjlab.org.cn>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-04-26 18:46:02 +08:00
bittersweet1999
e404b72c52
[Feature] support arenahard evaluation (#1096)
* support arenahard

* support arenahard

* support arenahard
2024-04-26 15:42:00 +08:00
bittersweet1999
6ba1c4937d
[Feature] Support Math evaluation via judgemodel (#1094)
* support openai math evaluation

* support openai math evaluation

* support openai math evaluation

* support math llm judge

* support math llm judge
2024-04-26 14:56:23 +08:00
Jingming Zhuo
41196c48ae
Add humaneval prompt from simple_evals, openai (#1076)
* [Feature] Add IFEval

* add humaneval prompt from simple_evals, openai
2024-04-24 17:40:50 +08:00
liushz
17735f0c13
Fix Llama-3 meta template (#1079)
Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-04-24 16:46:25 +08:00
Ke Bao
81d0e4d793
[Feature] Add lmdeploy tis python backend model (#1014)
* add lmdeploy tis python backend model

* fix pr check

* update
2024-04-23 14:27:11 +08:00
Fengzhe Zhou
8fe7b271cc
[Fix] Fix sequential runner (#1070) 2024-04-23 11:31:10 +08:00
Fengzhe Zhou
004ed79593
[Feature] Add TheoremQA with 5-shot (#1048)
* add TheoremQA with 5-shot

* cherry pick from add-huggingface-above-v4.33, good TheoremQA results
2024-04-22 15:22:04 +08:00
Fengzhe Zhou
a256753221
[Feature] Add LLaMA-3 Series Configs (#1065)
* add LLaMA-3 Series configs

* update readme
2024-04-22 14:39:31 +08:00
bittersweet1999
6f98c8d9ab
[Fix] Fix MultiRound Subjective Evaluation(#1043)
* fix multiround

* fix
2024-04-22 12:06:03 +08:00
Fengzhe Zhou
8c85edd1cd
[Sync] deprecate old mbpps (#1064) 2024-04-19 20:49:46 +08:00
Robin Chen
c172401323
[Fix] Fixed repeated loading of VLLM (#1051)
* [fix]Fixed the issue caused by the repeated loading of VLLM model during task segmentation.

* [fix] avoid TypeError: VLLM.__init__() got an unexpected keyword argument 'tokenizer_only'

* restore .pre-commit-config.yaml

* restore opencompass/tasks/openicl_infer.py

---------

Co-authored-by: IcyFeather <mengzhuo.happy@gmail.com>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-04-17 20:36:08 +08:00
Songyang Zhang
629836146a
[Doc] Update README (#1053)
* [Update] Update readme

* [Update] Update readme

* [Update] Update readme
2024-04-16 19:54:12 +08:00
Fengzhe Zhou
881bdbf6bd
[Sync] Bump version to 0.2.4 (#1052)
(cherry picked from commit 16ac6306c72fa202173289b55eaefe85e0fcb73c)

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-04-16 18:09:46 +08:00
Fengzhe Zhou
7a41951dda
[Fix] logger.error -> logger.debug in OpenAI wrapper (#1050)
* logger.error -> logger.info in OpenAI

* logger.info -> logger.debug in OpenAI
2024-04-15 21:08:13 +08:00
liuwei130
a00e57296f
[Feature] Add ChemBench (#1032)
* add ChemBench

* update results

* molbench -> ChemBench

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-04-12 08:46:26 +08:00
Fengzhe Zhou
bd7c11bb89
[Fix] Update setup.py install_requires (#1036) 2024-04-11 11:11:34 +08:00
Fengzhe Zhou
b39f501563
[Sync] update taco (#1030) 2024-04-09 17:50:23 +08:00
Mo Li
16f29b25f1
[Fix] Simplify needlebench summarizer (#1024)
* Conflicts:
	configs/summarizers/needlebench.py

* fix lint problems
2024-04-07 17:51:13 +08:00
Mo Li
f2af49337d
[Feature] Add ATC Choice Version (#1019)
* Squashed commit of the following:

commit c48ad194c3976dc63d1b60d8c8ab2d5ff9e1cbfe
Author: DseidLi <2568818204@qq.com>
Date:   Tue Apr 2 16:57:43 2024 +0800

    add atc_choice

commit 3ac6efea29619573e6fac8fa3cce464853dcead0
Merge: 2d4e559 8e3a9c3
Author: DseidLi <2568818204@qq.com>
Date:   Tue Apr 2 16:41:38 2024 +0800

    Merge branch 'atc_choice' into atc_add_choice

commit 8e3a9c396a3e5546d3faf584183f6fd60b974d5e
Merge: 150a036 0a6a03f
Author: DseidLi <2568818204@qq.com>
Date:   Tue Mar 26 04:47:07 2024 +0800

    Merge branch 'main' into atc_choice

    Conflicts:
    	configs/summarizers/needlebench.py
    	opencompass/datasets/needlebench/multi.py
    	opencompass/datasets/needlebench/origin.py
    	opencompass/datasets/needlebench/parallel.py

commit 150a036d6d990f26a57c974d1af83d88c31a0f9d
Merge: 8d6ac9a 940dd18
Author: DseidLi <2568818204@qq.com>
Date:   Wed Mar 20 03:49:08 2024 +0800

    Merge branch 'needlebench_fix' into atc_choice

commit 8d6ac9a1a43b1c9d0f0ea27e7d58968a203ea898
Author: DseidLi <2568818204@qq.com>
Date:   Wed Mar 20 03:41:49 2024 +0800

    optimize needlebench code

commit 940dd18a4270f24bc69edd2a780182c68918e1a9
Author: DseidLi <2568818204@qq.com>
Date:   Wed Mar 20 03:39:46 2024 +0800

    fix vllm

commit d8be6877bc41051f3edcc0421c462c834c0f1c9a
Merge: ecad78a 2527fda
Author: DseidLi <2568818204@qq.com>
Date:   Tue Mar 19 21:07:08 2024 +0800

    Merge remote-tracking branch 'origin/add_1M_dataset' into atc_choice

commit 2527fda8a5
Author: DseidLi <2568818204@qq.com>
Date:   Tue Mar 19 16:03:40 2024 +0800

    add model configs

commit 75425acdf8
Author: DseidLi <2568818204@qq.com>
Date:   Tue Mar 19 16:02:15 2024 +0800

    add prompt postion args

commit 367ba1ba61
Author: DseidLi <2568818204@qq.com>
Date:   Wed Feb 28 21:40:00 2024 +0800

    add Needlebench-1000K configs

commit ecad78af14c4bb00fe325779114b384c57ab30bf
Author: DseidLi <2568818204@qq.com>
Date:   Thu Mar 14 22:08:32 2024 +0800

    fix atc

commit 08772c0787b18872abadc9ffec3223941a5ee0c2
Merge: 9f3f8cf caf1cf8
Author: DseidLi <2568818204@qq.com>
Date:   Thu Mar 14 22:07:28 2024 +0800

    Merge branch 'main' into atc_choice

    Conflicts:
    	configs/datasets/needlebench/readme.md
    	configs/datasets/needlebench/readme_zh-CN.md
    	configs/summarizers/needlebench.py
    	opencompass/datasets/needlebench/atc.py
    	opencompass/summarizers/needlebench.py

commit 9f3f8cfb4452722734d334114ac1d14110e57406
Author: DseidLi <2568818204@qq.com>
Date:   Thu Mar 14 21:35:53 2024 +0800

    add atc-choice test

commit 52be7c1202376b4e09821188b826f1a805328129
Author: DseidLi <2568818204@qq.com>
Date:   Wed Mar 6 02:54:15 2024 +0800

    update needlebench randomseed and add vllm qwen14b

commit fc1effce596ae2e5ece4933e8cd34aef8e64a6f9
Merge: 4e747ed caf1cf8
Author: DseidLi <2568818204@qq.com>
Date:   Wed Mar 6 02:51:14 2024 +0800

    Merge branch 'main' into add_model_configs

commit 31834f9b23af3354ac3581ec86d693d0f05cdd1c
Merge: 7dabc82 120bf8b
Author: DseidLi <2568818204@qq.com>
Date:   Sun Mar 3 23:29:42 2024 +0800

    Merge branch 'main' of https://github.com/open-compass/opencompass into atc_choice

commit 4e747ed1988ddbcfcc7fff334601259ade72d363
Author: DseidLi <2568818204@qq.com>
Date:   Sun Mar 3 22:15:25 2024 +0800

    add internlm2-lmdeploy model and gemma configs

commit 7dabc828123d711c8cf834d6aab4137bb55e85ed
Author: DseidLi <2568818204@qq.com>
Date:   Sat Mar 2 17:26:15 2024 +0800

    add atc choice version -ZH

commit 996f8ae43d
Author: DseidLi <2568818204@qq.com>
Date:   Wed Feb 28 16:58:56 2024 +0800

    update readme for needlebench

commit f7266e873c
Author: DseidLi <2568818204@qq.com>
Date:   Wed Feb 28 16:44:53 2024 +0800

    move readme.md

commit 1c7375681d
Author: DseidLi <2568818204@qq.com>
Date:   Wed Feb 28 16:38:31 2024 +0800

    fix linting error

commit b6524f3ebf
Author: DseidLi <2568818204@qq.com>
Date:   Wed Feb 28 16:33:51 2024 +0800

    lint summarizer

commit c0d1190e39
Author: DseidLi <2568818204@qq.com>
Date:   Wed Feb 28 16:29:03 2024 +0800

    add needlebench intro, fix summarizer

commit 0965baf785
Author: DseidLi <2568818204@qq.com>
Date:   Mon Feb 26 13:31:26 2024 +0800

    fix bug in needlebench summarizer

commit 5d32b31eb8
Author: DseidLi <2568818204@qq.com>
Date:   Sat Feb 24 03:19:08 2024 +0800

    update act prompt

commit af82a7f085
Merge: 32bf9fe 53fe788
Author: DseidLi <2568818204@qq.com>
Date:   Fri Feb 23 17:50:32 2024 +0800

    Merge remote-tracking branch 'upstream/main' into needlebench

commit 32bf9fe802
Author: DseidLi <2568818204@qq.com>
Date:   Fri Feb 23 17:31:32 2024 +0800

    simplify needlebench 32k, 128k, 200k for eval

commit a7cb025e05
Author: DseidLi <2568818204@qq.com>
Date:   Fri Feb 23 14:48:58 2024 +0800

    add needlebench

* fix summarizer

* remove repeated code

* remove chinese comments
2024-04-07 15:46:20 +08:00
Mo Li
b50d163265
[Fix] Refactor Needlebench Configs for CLI Testing Support (#1020)
* add needlebench datasets suffix

* fix import

* update run.py args for summarizer key and dataset suffix

* update utils/run.py
2024-04-07 15:12:56 +08:00
bittersweet1999
2d4e559763
[Feature] Add multi-model judge and fix some problems (#1016)
* support multi-model judge and moe judge

* test_moe

* test_moe

* test

* add moe judge

* support multi-judge-model
2024-04-02 11:52:06 +08:00
Y0oMu
c220550fb9
updates docs (#1015)
Co-authored-by: youmuspc <yejiayi2004@outlook.com>
2024-04-02 10:30:04 +08:00
bittersweet1999
02e7eec911
[Feature] Support AlpacaEval_V2 (#1006)
* support alpacaeval_v2

* support alpacaeval

* update docs

* update docs
2024-03-28 16:49:04 +08:00
Mo Li
0a6a03fe1a
[Feature] update needlebench and configs (#986)
* add Needlebench-1000K configs

* add prompt postion args

* add model configs

* Update parallel.py

* fix lint
2024-03-25 18:05:01 +08:00
bittersweet1999
0665bb91a8
[Fix] Quick fix (#995) 2024-03-22 19:54:19 +08:00
Chaseldot
1d3198554b
[Fix] base.py change status into list (#994) 2024-03-22 17:06:34 +08:00
Ke Bao
e415ddf96a
[Fix] Fix turbomind_tis (#992) 2024-03-22 15:50:12 +08:00
bittersweet1999
054e9fa7e5
[Feature] add one script for subjective (#993)
* add one script for subjective

* add one script for subjective

* add one script for subjective

* add one script for subjective

---------

Co-authored-by: thebestannie <1290646445@qq.com>
2024-03-20 23:20:41 +08:00
Connor-Shen
0221d30877
[Fix] Update APPS/TACO (#988)
* [Feature] update apps/taco

* [Feature] update apps/taco
2024-03-19 20:21:39 +08:00
Connor-Shen
8a3c6e51ed
[Feature] Update APPS (#985)
* update post process

* update post process
2024-03-19 15:47:05 +08:00
Connor-Shen
d92595b671
[Feat] Support TACO (#966)
* [Feat] Support TACO

* update README

* update README
2024-03-19 15:39:16 +08:00
bittersweet1999
c78a4df923
add support for set prediction path (#984) 2024-03-19 14:32:15 +08:00
klein
4d2591acb2
modify the requirements/runtime.txt: numpy==1.23.4 --> numpy>=1.23.4 (#983)
Co-authored-by: zhangchuyu@pjlab.org.cn <zhangchuyu@pjlab.org.cn>
2024-03-18 20:25:55 +08:00
Jingming
89a8a8917b
[Feature] Add the implement of QuALITY datasets (#976)
#976
2024-03-15 21:22:38 +08:00
Jingming
c2d4717be2
[Fix] Fix a bug in internlm2 series configs (#977) 2024-03-15 15:21:35 +08:00
seanzhang-zhichen
7baa711fc7
[Fix] Fix doc problem (#975)
Co-authored-by: zhangzc <2608882093@qq.com>
2024-03-15 13:44:46 +08:00
Connor-Shen
3098d78845
[Bench] Support APPS (#963)
* [Feat] support apps

* [Feat] support apps

* [Feat] support apps

* update README
2024-03-13 16:09:23 +08:00
Fengzhe Zhou
2a741477fe
update links and checkers (#890) 2024-03-13 11:01:35 +08:00
Jingming
4c1533e59e
[Fix] fix the config's name of deepseek-coder (#964) 2024-03-12 19:36:52 +08:00
Fengzhe Zhou
ab6cdb2be8
[Sync] Bump version 0.2.3 (#957) 2024-03-12 11:51:56 +08:00
Fengzhe Zhou
64fde73b15
[Fix] Use logger.error on failure (#960) 2024-03-12 11:51:39 +08:00
Fengzhe Zhou
ed663ca17b
[Misc] Update owners (#961) 2024-03-12 11:51:25 +08:00
Songyang Zhang
47cb75a3f7
[Docs] Update README (#956)
* [Docs] Update README

* Update README.md

* [Docs] Update README
2024-03-12 11:40:34 +08:00
Fengzhe Zhou
bdd85358cc
[Sync] update 20240308 (#953) 2024-03-11 22:34:19 +08:00
bittersweet1999
848e7c8a76
[fix] add different temp for different question in mtbench (#954)
* add temp for mtbench

* add document for mtbench

* add document for mtbench
2024-03-11 17:24:39 +08:00
Songyang Zhang
7c1a819bb4
[Fix] Chinese version of ReadTheDoc (#947)
* [Fix] Chinese version of ReadTheDoc

* rename

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-03-08 18:10:05 +08:00
Yang Yong
3829be87b1
Fix LightllmApi ppl test (#951) 2024-03-08 12:04:44 +08:00
Yang Yong
107e022cf4
Support prompt template for LightllmApi. Update LightllmApi token bucket. (#945) 2024-03-06 15:33:53 +08:00
RunningLeon
c54a5d3b0f
Support get_ppl for TurbomindModel (#878)
* update ppl for turbomindmodel

* update api_server

* rename config and set thread_safe for pytorch engine if possible
2024-03-06 11:44:19 +08:00
fanqiNO1
caf1cf8a17
[Docs] Update rank link (#911) 2024-03-05 20:33:44 +08:00
Xu Song
2e993989a6
[Fix] FinanceIQ_datasets import error (#939)
* [Fix] Fix KeyError: 'FinanceIQ_datasets'

* [Fix] Fix KeyError: 'FinanceIQ_datasets'
2024-03-05 20:32:24 +08:00
Jingming
66d3aa4c01
[Feature] Add configs of deepseek-coder (#943) 2024-03-05 11:38:28 +08:00
Jingming
d0550268f3
[Fix] fix a bug of humanevalplus config (#944) 2024-03-05 11:37:17 +08:00
Fengzhe Zhou
b03d5dc531
[Sync] Sync Internal (#941) 2024-03-04 14:42:36 +08:00
yuantao2108
bbec7d8733
[Feature] add lveval benchmark (#914)
* add lveval benchmark

* add LVEval readme file

* update LVEval readme file

* Update configs/eval_bluelm_32k_lveval.py

* Update configs/eval_llama2_7b_lveval.py

---------

Co-authored-by: yuantao <yuantao@infini-ai.com>
Co-authored-by: Mo Li <82895469+DseidLi@users.noreply.github.com>
2024-03-04 11:22:03 +08:00
Mo Li
8142f399a8
[Feature] Upgrade the needle-in-a-haystack experiment to Needlebench (#913)
* add needlebench

* simplify needlebench 32k, 128k, 200k for eval

* update act prompt

* fix bug in needlebench summarizer

* add needlebench intro, fix summarizer

* lint summarizer

* fix linting error

* move readme.md

* update readme for needlebench

* update docs of needlebench

* simplify needlebench summarizers
2024-03-04 11:10:52 +08:00
Mo Li
120bf8b399
add vllm model configs (#938) 2024-03-01 17:31:51 +08:00
Kdump
3e9844ed33
[Fix]Fixed the problem of never entering task.run() mode in local scheduling mode. (#930)
* Fixed the problem of never entering task.run() mode in local scheduling mode.

get_command_template方法中为命令行前缀添加了CUDA_VISIBLE_DEVICES=或set CUDA_VISIBLE_DEVICES=。导致task.run()分支失效。
---------
CUDA_VISIBLE_DEVICES= or set CUDA_VISIBLE_DEVICES= is added to the command line prefix in the get_command_template method. Causes the task.run() branch to fail.

* [Fix]Fixed the problem of never entering task.run() mode in local scheduling mode.

get_command_template方法中为命令行前缀添加了CUDA_VISIBLE_DEVICES=或set CUDA_VISIBLE_DEVICES=。导致task.run()分支失效。
---
CUDA_VISIBLE_DEVICES= or set CUDA_VISIBLE_DEVICES= is added to the command line prefix in the get_command_template method. Causes the task.run() branch to fail.

* [Fix]Fixed the problem of never entering task.run() mode in local scheduling mode.

get_command_template方法中为命令行前缀添加了CUDA_VISIBLE_DEVICES=或set CUDA_VISIBLE_DEVICES=。导致task.run()分支失效。
CUDA_VISIBLE_DEVICES= or set CUDA_VISIBLE_DEVICES= is added to the command line prefix in the get_command_template method. Causes the task.run() branch to fail.
2024-02-29 14:35:45 +08:00
Skyfall-xzz
4c45a71bbc
[Feature] Support OpenFinData (#896)
* [Feature] Support OpenFinData

* add README for OpenFinData

* update README
2024-02-29 12:55:07 +08:00
bittersweet1999
001e77fea2
[Feature] add support for gemini (#931)
* add gemini

* add gemini

* add gemini
2024-02-28 19:38:34 +08:00
Fengzhe Zhou
9afbfa3639
[Sync] Fix TEvalEvaluator (#929) 2024-02-28 16:05:30 +08:00
Fengzhe Zhou
ba7cd58da3
[Update] Rename dataset pack (#922) 2024-02-28 10:54:04 +08:00
Fengzhe Zhou
5ce8e0450e
[Fix] Fix type hint in IFEval (#915) 2024-02-28 10:53:40 +08:00
Jingming
53fe788d27
[Fix] fix ifeval (#909) 2024-02-23 16:52:03 +08:00
bittersweet1999
45c606bcd0
[Fix] Fix IFEval (#906)
* fix ifeval

* fix ifeval

* fix ifeval

* fix ifeval
2024-02-22 16:51:34 +08:00
RunningLeon
32ba0b074e
Support lmdeploy pytorch engine (#875)
* add lmdeploy pytorch model

* fix

* speed up encoding and decoding

* fix

* change tokenizer
2024-02-22 03:46:07 -03:00
Xu Song
6d04decab4
[Fix] Fix moss template config (#897) 2024-02-21 11:19:24 +08:00
Fengzhe Zhou
2b7d376e3d
[Fix] Fix chatglm2 config (#893) 2024-02-19 14:55:53 +08:00
Fengzhe Zhou
9119e2ac39
[Fix] rename qwen2-beta -> qwen1.5 (#894) 2024-02-19 14:55:35 +08:00
Yang Yong
b6e21ece38
Support LightllmApi input_format (#888) 2024-02-19 10:02:59 +08:00
Fengzhe Zhou
08133e060a
[Sync] Bump version to 0.2.2 (#880) 2024-02-07 10:45:48 +08:00
hailsham
e257254b00
[Feature] add global retriever config (#842)
* add global retriever config

* give zero shot overwrite example

* give zero shot overwrite example

---------

Co-authored-by: Lei Fei <SENSETIME\leifei1@cn3114002087l.domain.sensetime.com>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-02-07 00:30:20 +08:00
hailsham
dd444685bb
fix bug of gsm8k_postprocess (#863)
* fix bug of gsm8k_postprocess

* update postprocess

---------

Co-authored-by: Lei Fei <SENSETIME\leifei1@cn3114002087l.domain.sensetime.com>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-02-06 23:52:47 +08:00
Connor-Shen
444d8d9507
[feat] support multipl-e (#846)
* [feat] support humaneval_multipl-e

* format

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-02-06 23:30:28 +08:00
Yggdrasill7D6
a6c49f15ce
fix lawbench 2-1 f0.5 score calculation bug (#795)
* fix lawbench 2-1 f0.5 score calculation bug

* use path in overall datasets folder

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-02-06 22:20:11 +08:00
bittersweet1999
1c8e193de8
[Fix] hotfix for mtbench (#877)
* hotfix for mtbench

* hotfix
2024-02-06 21:26:47 +08:00
Fengzhe Zhou
d34ba11106
[Sync] Merge branch 'dev' into zfz/update-keyset-demo (#876) 2024-02-05 23:29:10 +08:00
bittersweet1999
32b5948f4e
[Fix] add do sample demo for subjective dataset (#873)
* add do sample demo for subjective dataset

* fix strings

* format

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-02-05 15:55:58 +08:00
Skyfall-xzz
7ad1168062
Support NPHardEval (#835)
* support NPHardEval

* add .md file and fix minor bugs

* refactor and minor fix

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2024-02-05 15:52:28 +08:00
zhulinJulia24
b4a9acd7be
Update daily test (#871)
* add daily test case

* Update pr-run-test.yml

* Update daily-run-test.yml

* Update daily-run-test.yml

* Update pr-run-test.yml

* Update daily-run-test.yml

* Update oc_score_assert.py

* Update daily-run-test.yml

* Update daily-run-test.yml

* Update daily-run-test.yml

* update testcase baseline

* fix test case name

* add more models into daily test

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-02-05 15:52:00 +08:00
Fengzhe Zhou
fc84aff963
[CI] Update github workflow cuda image (#874)
* update workflow

* another trial

* another trial

* another trial
2024-02-05 15:22:59 +08:00
Yuchen Yan
fed7d800c6
[Fix] Fix error in gsm8k evaluator (#782)
Co-authored-by: jiangjin1999 <1261842974@qq.com>
2024-02-04 22:55:11 +08:00
bittersweet1999
7806cd0f64
[Feature] support alpacaeval (#809)
* support alpacaeval_v1

* Update opencompass/summarizers/subjective/__init__.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update opencompass/summarizers/subjective/alpacaeval_v1.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* fix conflict

* support alpacaeval v2

* support alpacav2

---------

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2024-02-04 14:18:36 +08:00
zhulinJulia24
0919b08ec8
[Feature] Add daily test case (#864)
* add daily test case

* Update pr-run-test.yml

* Update daily-run-test.yml

* Update daily-run-test.yml

* Update pr-run-test.yml

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-02-02 12:03:05 +08:00
RunningLeon
4c87e777d8
[Feature] Add end_str for turbomind (#859)
* fix

* update

* fix internlm1

* fix docs

* remove sys
2024-02-01 22:31:14 +08:00
bittersweet1999
5c6dc908cd
fix compass arena (#854) 2024-01-30 16:34:38 +08:00
Guo Qipeng
4f78388c71
Update runtime.txt to fix rouge_chinese bugs. (#803)
* Update runtime.txt to fix rouge_chinese bugs.

the wheel file of rouge_chinese will overwrite the rouge package, causing bugs. Replacing it to the github code, which is the correct version.

* fix PEP format issues

* fix PEP format issues

* enable pip install

---------

Co-authored-by: 郭琦鹏 <guoqipeng@pjlab.org.cn>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-01-29 19:18:22 +08:00
del-zhenwu
e8067ac456
Create link-check.yml (#853)
* Create link-check.yml

* Update link-check.yml
2024-01-29 19:16:52 +08:00
Songyang Zhang
cdca59ff49
[Fix] Update Zhipu API and Fix issue min_out_len issue of API models (#847)
* Update zhipu api and fix min_out_len issue of API class

* Update example

* Update example
2024-01-28 14:52:43 +08:00
Jingming
2801883351
[Fix] Fix acc of IFEval (#849)
* [Feature] Add IFEval

* [Fix] Changing the Score Rule.
2024-01-27 22:27:07 +08:00
Xiaoming Shi
35aace776a
[Fix] Update MedBench (#845) 2024-01-26 17:56:13 +08:00
Songyang Zhang
8ed022b4c4
Update Sensetime API (#844) 2024-01-26 16:40:49 +08:00
Hubert
4aa74565e2
[Feat] minor update agent related (#839)
* [Feat] update cibench

* [Feat] Support CIBench

* [Feat] Support CIBench

* [Feat] Support CIBench

* [Feat] Support CIBench
2024-01-26 14:15:51 +08:00
bittersweet1999
77be07dbb5
[Fix] fix corev2 (#838)
* fix corev2

* fix corev2
2024-01-24 18:15:29 +08:00
Fengzhe Zhou
0991dd33a0
[Sync] Updata dataset cfg for internMath (#837)
Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-01-24 16:30:32 +08:00
zhulinJulia24
f7d7837ac0
add fail notify (#836) 2024-01-24 14:26:30 +08:00
Fengzhe Zhou
f367551668
update doc (#830) 2024-01-24 13:39:28 +08:00
Songyang Zhang
793e32c9cc
[Feature] Update API implementation (#834) 2024-01-24 13:35:21 +08:00
bittersweet1999
2ee8e8a1a1
[Feature] add mtbench (#829)
* add mtbench

* add mtbench

* Update configs/datasets/subjective/multiround/mtbench_judgeby_gpt4.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update configs/datasets/subjective/multiround/mtbench_judgeby_gpt4.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update opencompass/datasets/subjective/__init__.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update opencompass/datasets/subjective/mtbench.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* fix mtbench

---------

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2024-01-24 12:11:47 +08:00
Jingming
e059a5c2bf
[Feature] Add IFEval (#813)
* [Feature] Add IFEval

* [Doc] add introduction of IFEval
2024-01-23 20:07:49 +08:00
bittersweet1999
3d9bb4aed7
[Fix] fix strings (#833)
* add compass arena

* add compass_arena

* add compass arena

* Update opencompass/summarizers/subjective/compass_arena.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update opencompass/summarizers/subjective/__init__.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update opencompass/datasets/subjective/compass_arena.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update opencompass/datasets/subjective/__init__.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update configs/eval_subjective_compassarena.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update configs/datasets/subjective/compassarena/compassarena_compare.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update configs/eval_subjective_compassarena.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update configs/datasets/subjective/compassarena/compassarena_compare.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* fix check position bias

* fix string

---------

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2024-01-23 10:57:26 +00:00
bittersweet1999
2d4da8dd02
[Feature] Add CompassArena (#828)
* add compass arena

* add compass_arena

* add compass arena

* Update opencompass/summarizers/subjective/compass_arena.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update opencompass/summarizers/subjective/__init__.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update opencompass/datasets/subjective/compass_arena.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update opencompass/datasets/subjective/__init__.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update configs/eval_subjective_compassarena.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update configs/datasets/subjective/compassarena/compassarena_compare.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update configs/eval_subjective_compassarena.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* Update configs/datasets/subjective/compassarena/compassarena_compare.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* fix check position bias

---------

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2024-01-23 15:12:46 +08:00
RangiLyu
40a2441deb
Update hf_internlm2_chat template (#823)
* Update hf_internlm2_chat template

* Update 20B
2024-01-19 18:21:47 +08:00
Guo Qipeng
e975a96fa1
Update cdme config and evaluator (#812)
* update cdme config and evaluator

* fix cdme prompt

* move CDME trim post-processor as a separate evaluator

---------

Co-authored-by: 郭琦鹏 <guoqipeng@pjlab.org.cn>
2024-01-19 11:29:27 +08:00
Yang Yong
f09a2ff418
Add LightllmApi KeyError log & Update doc (#816)
* Add LightllmApi KeyError log

* Update LightllmApi doc
2024-01-18 22:23:38 +08:00
zhulinJulia24
8b5c467cc5
Test runner update - split step, change schedule time and disable hf cache (#814)
* Update pr-run-test.yml

* Update pr-run-test.yml

* Update pr-run-test.yml

* split step and change order, change schedule time and disable hf cache
2024-01-18 21:04:41 +08:00
Mo Li
dcc32ed856
[Fix] Update yi 200k config (#815) 2024-01-18 20:54:24 +08:00
RunningLeon
61fe873c89
[Fix] Fix turbomind and update docs (#808)
* update

* update docs

* add engine_config and gen_config in eval_config

* update

* fix

* fix

* fix

* fix docstr

* fix url
2024-01-18 14:41:35 +08:00
Fengzhe Zhou
9e5746d3d8
[Doc] Update News (#810) 2024-01-17 18:22:12 +08:00
Fengzhe Zhou
b4afe3e7c1
[Sync] Add InternLM2 Keyset Evaluation Demo (#807)
Co-authored-by: zhangyifan1 <zhangyifan1@pjlab.org.cn>
2024-01-17 13:48:12 +08:00
Mo Li
acae560911
Added support for multi-needle testing in needle-in-a-haystack test (#802)
* Add NeedleInAHaystack Test

* Apply pre-commit formatting

* Update configs/eval_hf_internlm_chat_20b_cdme.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* add needle in haystack test

* update needle in haystack test

* update plot function in tools_needleinahaystack.py

* optimizing needleinahaystack dataset generation strategy

* modify minor formatting issues

* add English version support

* change NeedleInAHaystackDataset to dynamic loading

* change NeedleInAHaystackDataset to dynamic loading

* fix needleinahaystack test eval bug

* fix needleinahaystack config bug

* Added support for multi-needle testing in needle-in-a-haystack test

* Optimize the code for plotting in the needle-in-a-haystack test.

* Correct the typo in the dataset parameters.

* update needleinahaystack test docs

---------

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2024-01-17 13:47:34 +08:00
RunningLeon
0836aec67b
[Feature] Update evaluate turbomind (#804)
* update

* fix

* fix

* fix
2024-01-17 11:09:50 +08:00
bittersweet1999
814b3f73bd
reorganize subject files (#801) 2024-01-16 18:03:11 +08:00
zhulinJulia24
2cd091647c
Add test runner, one case, daily and pr trigger (#751)
* init test yaml

* add simple pr

* update

* update

* change name

* Update pr-run-test.yml

* Update pr-run-test.yml

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-01-16 11:59:22 +08:00
bittersweet1999
83d6c48378
[Feature] Add configs for creationbench (#791)
* add creationv2_zh

* add creationv2_zh

* add eng config for creationbench

* add eng config for creationbench

* add eng config for creationbench
2024-01-12 14:20:21 +08:00
Hubert
d0dc3534e5
[Fix] hot fix for requirements (#789) 2024-01-11 15:48:32 +08:00
Songyang Zhang
467ad0ac21
Update gsm8k agent prompt (#788) 2024-01-11 14:07:36 +08:00
notoschord
d3a0ddc3ef
[Feature] Add support for Nanbeige API (#786)
Co-authored-by: notoschord <wangzekai@kanzhun.com>
2024-01-11 13:54:27 +08:00
bittersweet1999
5679edb490
add temperature in alles (#787) 2024-01-11 03:57:24 +00:00
Xiaoming Shi
ad872a5dc2
[Feature] Update MedBench (#779)
* update medbench

* medbench update

* format medbench

* format

* Update

* update

* update

* update suffix

---------

Co-authored-by: 施晓明 <PJLAB\shixiaoming@pjnl104220118l.pjlab.org>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-01-09 11:42:44 +08:00
Fengzhe Zhou
a74e4c1a8d
[Sync] Bump version to 0.2.1 (#778) 2024-01-08 14:56:28 +00:00
Fengzhe Zhou
32f40a8f83
[Sync] Sync with internal codes 2023.01.08 (#777) 2024-01-08 14:07:24 +00:00
jiangjin1999
8194199d79
[Feature] *_batch_generate* function, add the MultiTokenEOSCriteria (#772)
* jiangjin1999: in the _batch_generate function, add the MultiTokenEOSCriteria feature to speed up inference.

* jiangjin1999: in the _batch_generate function, add the MultiTokenEOSCriteria feature to speed up inference.

---------

Co-authored-by: jiangjin08 <jiangjin08@MBP-2F32S5MD6P-0029.local>
Co-authored-by: jiangjin08 <jiangjin08@a.sh.vip.dianping.com>
2024-01-08 16:40:02 +08:00
Fengzhe Zhou
f78fcf6eeb
[Docs] Update contamination docs (#775) 2024-01-08 16:37:28 +08:00
liyucheng09
0b2863039e
[Feature] Contamination analysis for MMLU, Hellaswag, and ARC_c (#699)
* Contamination analysis for ARC_c, mmlu, and Hellaswag

* update `eval_contamination.py`

* update `contamination.py` summarizer

* fix `eval_contamination.py`

* add mmlu groups for contamination analysis
2024-01-08 15:51:48 +08:00
tpoisonooo
ba1b684fec
typo(installation.md): fix unzip commands (#774)
* Update installation.md

* Update installation.md
2024-01-08 14:23:35 +08:00
Yuchen Yan
11f3b91e78
[Fix] fix typos in drop prompt (#773)
Co-authored-by: yanyuchen04 <yanyuchen04@meituan.com>
2024-01-08 14:22:35 +08:00
Connor-Shen
30a90d8dd8
Support Mbpp_plus dataset (#770)
* support mbpp+

* support mbpp+

* minor fix

* [Feat] minor fix

---------

Co-authored-by: yingfhu <yingfhu@gmail.com>
2024-01-05 22:01:57 +08:00
bittersweet1999
3c606cb712
quick fix for postprocess pred extraction (#771) 2024-01-05 21:10:18 +08:00
Songyang Zhang
0c75f0f95a
[Update] Update introduction of CompassBench-2024-Q1 (#769)
* [Doc] Update Example of CompassBench

* [Doc] Update Example of CompassBench

* [Doc] Update Example of CompassBench

* update

* Update docs/zh_cn/advanced_guides/compassbench_intro.md

Co-authored-by: Fengzhe Zhou <zfz-960727@163.com>

---------

Co-authored-by: Fengzhe Zhou <zfz-960727@163.com>
2024-01-05 20:39:36 +08:00
bittersweet1999
2163f9398f
[Feature] add subject ir dataset (#755)
* add subject ir

* Add ir dataset

* Add ir dataset
2024-01-05 12:00:57 +00:00
bittersweet1999
be369c3e06
[Feature] Add multi_round dataset evaluation (#766)
* multi_round dataset

* add multi_round evaluation
2024-01-04 10:37:52 +00:00
bittersweet1999
7cd65d49d8
[Fix] Fix small bug in alignbench (#764)
* fix small bugs

* fix small bugs
2024-01-03 07:44:53 +00:00
Chris Liu
3eb225a5e6
[Feature] Support LLaMA2-Accessory (#732)
* Support LLaMA2-Accessory

* remove strip

* clear imports

* reformat

* fix lint

* fix lint

* update readme

* update readme

* update readme

* update readme
2024-01-02 20:48:51 +08:00
HUANG Fei
ba027eeeac
[Feature] Add support of qwen api (#735) 2024-01-02 20:47:12 +08:00
Mo Li
33f8df1ca3
[Update] Change NeedleInAHaystackDataset to dynamic dataset loading (#754)
* Add NeedleInAHaystack Test

* Apply pre-commit formatting

* Update configs/eval_hf_internlm_chat_20b_cdme.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* add needle in haystack test

* update needle in haystack test

* update plot function in tools_needleinahaystack.py

* optimizing needleinahaystack dataset generation strategy

* modify minor formatting issues

* add English version support

* change NeedleInAHaystackDataset to dynamic loading

* change NeedleInAHaystackDataset to dynamic loading

* fix needleinahaystack test eval bug

* fix needleinahaystack config bug

---------

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2024-01-02 17:22:56 +08:00
Francis-llgg
b69fe2343b
[Feature] Add GPQA Dataset (#729)
* check

* message

* add

* change prompt

* change a para nameq

* modify name of the file

* delete an useless file
2024-01-01 15:54:40 +08:00
Francis-llgg
ef3ae63539
[Feature] Add new dataset mastermath2024v1 (#744)
* add new dataset mastermath2024v1

* change it to simplified chinese prompt

* change file name
2024-01-01 15:53:24 +08:00
Mo Li
17b8e929dd
[Feature] Update plot function in tools_needleinahaystack.py (#747)
* Add NeedleInAHaystack Test

* Apply pre-commit formatting

* Update configs/eval_hf_internlm_chat_20b_cdme.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* add needle in haystack test

* update needle in haystack test

* update plot function in tools_needleinahaystack.py

* optimizing needleinahaystack dataset generation strategy

* modify minor formatting issues

---------

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2023-12-29 18:51:09 +08:00
Hubert
327951087f
[Feat] update code config (#749)
* [Feat] update code dataset

* [Feat] update code dataset

* [Feat] update code dataset
2023-12-29 18:46:34 +08:00
bittersweet1999
fe0b717033
add creationbench (#753) 2023-12-29 10:03:44 +00:00
bittersweet1999
8728287a55
fix erro in configs (#750) 2023-12-28 11:53:07 +00:00
Connor-Shen
81098722d2
add chinese version of humaneval, mbpp (#743)
* add chinese_version of humaneval,mbpp

* add humaneval&mbpp gen.py

* minor fix

* minor add

---------

Co-authored-by: yingfhu <yingfhu@gmail.com>
2023-12-28 14:47:56 +08:00
bittersweet1999
db919f0191
[Fix] SubSizePartition fix (#746)
* fix subjective_eval

* subject_eval partition situation fixed

* subject_eval partition situation fixed
2023-12-28 11:46:46 +08:00
Hubert
0a525985e8
[Feature] Support sanitized MBPP dataset (#745) 2023-12-27 22:17:23 +08:00
bittersweet1999
dfd9ac0fd9
[Feature] Add other judgelm prompts for Alignbench (#731)
* add judgellm prompts

* add judgelm prompts

* update import info

* fix situation that no abbr in config

* fix situation that no abbr in config

* add summarizer for other judgellm

* change config name

* add maxlen

* add maxlen

* dict assert

* dict assert

* fix strings

* fix strings
2023-12-27 17:54:53 +08:00
Yang Yong
54345c56b7
Update LightllmApi and Fix mmlu bug (#738)
* Update LightllmApi and Fix mmlu bug

* checkout mmlu_gen_a484b3.py

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2023-12-27 13:49:08 +08:00
philipwangOvO
34561ececb
[Feature] Add InfiniteBench (#739)
* add InfiniteBench

* add InfiniteBench

---------

Co-authored-by: wangchonghua <wangchonghua@pjlab.org.cn>
2023-12-26 15:36:27 +08:00
Fengzhe Zhou
3a68083ecc
[Sync] update configs (#734) 2023-12-25 21:59:16 +08:00
Songyang Zhang
ad96f2156f
Update merge script (#733) 2023-12-25 16:45:22 +08:00
AllentDan
336d8d76ff
add turbomind restful api support (#693)
* add turbomind restful api support

* config

* top_p 0.8

* top_k = 1
2023-12-24 01:40:00 +08:00
bittersweet1999
e985100cd1
[Fix] Fix subjective alignbench (#730) 2023-12-23 20:06:53 +08:00
Mo Li
0e24f4213e
[Feature] Add NeedleInAHaystack Test Support (#714)
* Add NeedleInAHaystack Test

* Apply pre-commit formatting

* Update configs/eval_hf_internlm_chat_20b_cdme.py

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>

* add needle in haystack test

* update needle in haystack test

---------

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2023-12-23 12:00:51 +08:00
loveSnowBest
4a2d1926a2
[News] add news for T-Eval (#727)
* add news for teval

* update

* update doc for cz&en
2023-12-22 19:58:24 +08:00
RunningLeon
e34c552282
[Feature] Update configs for evaluating chat models like qwen, baichuan, llama2 using turbomind backend (#721)
* add llama2 test

* fix

* test qwen chat-7b

* test w4

* add baichuan2

* update

* update

* update configs and docs

* update
2023-12-21 18:22:17 +08:00
bittersweet1999
fbb912ddf3
[Feature] Add abbr for judgemodel in subjective evaluation (#724)
* add_judgemodel_abbr

* add judgemodel abbr
2023-12-21 15:58:20 +08:00
Skyfall-xzz
b35d991786
[Feature] Add ReasonBench(Internal) dataset (#577)
* [Feature] Add reasonbench dataset

* add configs for supporting generative inference & merge datasets in the same category

* modify config filename to prompt version

* fix codes to meet pre-commit requirements

* lint the code to meet pre-commit requirements

* Align Load_data Sourcecode Briefly

* fix bugs

* reduce code redundancy
2023-12-20 17:57:42 +08:00
Jingming
76a95e9e81
[Feature] Support the use of humaneval_plus. (#720)
* [Feature] Support the use of humaneval_plus.

* [Feature] Add humaneval_plus_gen.py

* minor check

* [Fix] Fix bug

---------

Co-authored-by: yingfhu <yingfhu@gmail.com>
2023-12-20 17:25:17 +08:00
bittersweet1999
47e745d748
quick fix for maxoutlen (#719) 2023-12-20 00:00:28 +08:00
Hubert
fdf18a3238
[Docs] Update Docker docs (#718)
* [Docs] update docker docs

* [Docs] update docker docs
2023-12-19 23:29:43 +08:00
Hubert
5e8b838f51
[Feat] Update math/agent (#716)
* minor add

* minor add

* minor fix
2023-12-19 21:20:42 +08:00
bittersweet1999
97c2068bd9
[Feature] Add JudgeLLMs (#710)
* add judgellms

* add judgellms

* add sub_size_partition

* add docs

* add ref
2023-12-19 18:40:25 +08:00
Hubert
eda72e756e
[Fix] minor fix openai (#711) 2023-12-18 15:45:31 +08:00
Songyang Zhang
637628a70f
[Doc] Update Doc for Alignbench (#707)
* update alignmentbench

* update alignmentbench

* update doc

* update

* update
2023-12-15 15:07:25 +08:00
Jingming
d7e7a637a5
[Fix] fix a bug on configs/eval_mixtral_8x7b.py (#706) 2023-12-15 14:15:32 +08:00
DseidLi
db2920326a
[Fix] remove redundant in gsm8k.py (#700)
Removed redundant code in GSM8KDataset.load method.
2023-12-14 19:55:58 +08:00
Songyang Zhang
bfe4aa2af5
[Fix] Update alignmentbench (#704)
* update alignmentbench

* update alignmentbench

* update alignmentbench
2023-12-14 18:24:21 +08:00
bittersweet1999
1fe152b3e8
[Feature] Support AlignmentBench infer and judge (#697)
* alignmentbench infer and judge

* alignmentbench

* alignmentbench done

* alignment all done

* alignment all done
2023-12-13 19:59:30 +08:00
Fengzhe Zhou
cadab9474f
[Doc] Update contamination docs (#698)
* update contamination docs

* add citation

* Update contamination_eval.md

* Update contamination_eval.md

---------

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2023-12-13 18:03:39 +08:00
Hubert
a94598d921
[Feat] update python action and slurm (#694) 2023-12-13 10:41:10 +08:00
bittersweet1999
6130394165
[Feature] Add double order of subjective evaluation and removing duplicated response among two models (#692)
* add features

* add doc string

* add doc string
2023-12-12 20:58:17 +08:00
Xiaoyu Zhang
82a533a690
add rwkv-5-3b model (#666)
* support rwkv5-3b learnboard

* update rwkv-5-3b config

* update config

* refine

* fix bug

* update config

* refine

* reduce batch size

* refine

* reduce batch size to avoid oom in special datasets

* Update huggingface.py

* Update huggingface.py
2023-12-12 18:15:19 +08:00
Hubert
4780b39eda
[Sync] format (#690)
Co-authored-by: Leymore <zfz-960727@163.com>
2023-12-12 14:03:45 +08:00
bittersweet1999
3e77175720
[Fix] Hotfix for Subjective Evaluation (#686) 2023-12-12 09:22:08 +08:00
bittersweet1999
465308e430
[Feature] Add Subjective Evaluation (#680)
* new version of subject

* fixed draw

* fixed draw

* fixed draw

* done

* done

* done

* done

* fixed lint
2023-12-11 22:22:11 +08:00
Hubert
4f0b373a0a
[Fix] fix docstring (#684) 2023-12-11 19:12:01 +08:00
Hubert
e78857ac36
[Sync] minor test (#683) 2023-12-11 17:42:53 +08:00
Jingming
dd4318f6ab
[Feature] enhance the ability of humaneval_postprocess (#676)
* [Feature] enhance the ability of humaneval_postprocess

* refactor

* [Feature] Keep the old version of the function and realize the new function in humaneval_postprocess_v2.

* Update opencompass/datasets/humaneval.py

---------

Co-authored-by: Leymore <zfz-960727@163.com>
Co-authored-by: Hubert <42952108+yingfhu@users.noreply.github.com>
2023-12-11 14:39:56 +08:00
Hubert
1029119e39
[Feat] support pr merge test ci (#669)
* [Feat] support ci

* [Feat] support ci

* [Feat] support ci

* [Feat] support ci

* init docs

* init docs

* init docs
2023-12-11 14:12:04 +08:00
Haodong Duan
6a928b996a
[Doc] Update README (#682) 2023-12-10 21:27:46 +08:00
Songyang Zhang
e25c5f9525
[Enhancement] Update API Interface and Mixtral (#681)
* [Enhancement] Update API interface

* [Enhancement] Update API interface

* Update mixtral

* Update readme
2023-12-10 13:29:26 +08:00
Xiaoming Shi
1bf85949ef
[Feature] Add medbench (#678)
* update medbench

* medbench update

* format medbench

* format

---------

Co-authored-by: 施晓明 <PJLAB\shixiaoming@pjnl104220118l.pjlab.org>
Co-authored-by: Leymore <zfz-960727@163.com>
2023-12-09 16:05:46 +08:00
Jingming
7cb53a95fa
[Fix] fix bug on standart_deviation summarizer (#675) 2023-12-08 13:38:07 +08:00
liyucheng09
05bbce8b08
[Feature] Add Data Contamination Analysis (#639)
* add contamination analysis to ceval

* fix bugs

* add contamination docs

* to pass CI check

* update

---------

Co-authored-by: zhangyifan1 <zhangyifan1@pjlab.org.cn>
Co-authored-by: Leymore <zfz-960727@163.com>
2023-12-08 10:00:11 +08:00
Fengzhe Zhou
3a354bd1da
add qwen and deepseek configs (#672) 2023-12-07 20:29:00 +08:00
bittersweet1999
1c95790fdd
New subjective judgement (#660)
* TabMWP

* TabMWP

* fixed

* fixed

* fixed

* done

* done

* done

* add new subjective judgement

* add new subjective judgement

* add new subjective judgement

* add new subjective judgement

* add new subjective judgement

* modified to a more general way

* modified to a more general way

* final

* final

* add summarizer

* add new summarize

* fixed

* fixed

* fixed

---------

Co-authored-by: caomaosong <caomaosong@pjlab.org.cn>
2023-12-06 13:28:33 +08:00
rolellm
e10f1c9139
added rolebench dataset. (#633)
* added rolebench

* 修改了不合理的变量名

* 修改了评论中的变量名
2023-12-01 22:54:42 +08:00
liushz
f4bbff6537
[Feature] Update MathBench CodeInterpreter & fix MathBench Bug (#657)
* Update MathBench CodeInterpreter & fix MathBench Bug

* Fix errors

* update

---------

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
Co-authored-by: Fengzhe Zhou <zfz-960727@163.com>
2023-12-01 22:27:24 +08:00
Hubert
9eb5cadcac
[Feat] update gsm8k and math agent config (#652)
* [Feat] update gsm8k and math agent config

* minor fix
2023-12-01 15:08:38 +08:00
liushz
a331c9abfd
[Feature] Add wikibench dataset (#655)
* Add WikiBench

* Add WikiBench

* format

---------

Co-authored-by: Leymore <zfz-960727@163.com>
2023-12-01 14:56:54 +08:00
liushz
e019c831fe
[Feature] Add Chinese version: commonsenseqa, crowspairs and nq (#144)
* add Chinese version: csqa crowspairs nq

* Update cn_data

* Update cn_data

* update format

---------

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
Co-authored-by: Leymore <zfz-960727@163.com>
2023-11-30 15:33:02 +08:00
Ma Zerun
6aaf3b91ec
[Feature] Support chat style inferencer. (#643)
* [Feature] Support chat style inferencer.

* [Fix] use new prompt

* [Fix] use new prompt

---------

Co-authored-by: yingfhu <yingfhu@gmail.com>
2023-11-30 14:00:06 +08:00
Fengzhe Zhou
5933c04fda
fix hellaswag_ppl_47bff9 (#648) 2023-11-29 16:51:44 +08:00
Hubert
e9e75fb4eb
[Fix] remove colossalai dependency (#645) 2023-11-28 14:09:44 +08:00
3045 changed files with 156062 additions and 21856 deletions

42
.github/scripts/eval_regression_api.py vendored Normal file
View File

@ -0,0 +1,42 @@
from mmengine.config import read_base
from opencompass.models.openai_api import OpenAISDK
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_gen import \
race_datasets # noqa: F401, E501
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
models = [
dict(
abbr='lmdeploy-api-test',
type=OpenAISDK,
key='EMPTY',
openai_api_base='http://localhost:23333/v1',
path='internlm3',
tokenizer_path='internlm/internlm3-8b-instruct',
rpm_verbose=True,
meta_template=api_meta_template,
query_per_second=128,
max_out_len=1024,
max_seq_len=4096,
temperature=0.01,
batch_size=128,
retry=20,
)
]
for d in datasets:
d['reader_cfg']['test_range'] = '[0:16]'

View File

@ -0,0 +1,210 @@
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.ARC_c.ARC_c_few_shot_ppl import \
ARC_c_datasets # noqa: F401, E501
from opencompass.configs.datasets.bbh.bbh_gen_98fba6 import \
bbh_datasets # noqa: F401, E501
from opencompass.configs.datasets.cmmlu.cmmlu_ppl_041cbf import \
cmmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.dingo.dingo_gen import \
datasets as dingo_datasets # noqa: F401, E501
from opencompass.configs.datasets.drop.drop_gen_a2697c import \
drop_datasets # noqa: F401, E501
from opencompass.configs.datasets.GaokaoBench.GaokaoBench_no_subjective_gen_d21e37 import \
GaokaoBench_datasets # noqa: F401, E501
from opencompass.configs.datasets.gpqa.gpqa_few_shot_ppl_4b5a83 import \
gpqa_datasets # noqa: F401, E501
# Corebench v1.7
from opencompass.configs.datasets.gsm8k.gsm8k_gen_17d0dc import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.hellaswag.hellaswag_10shot_ppl_59c85e import \
hellaswag_datasets # noqa: F401, E501
from opencompass.configs.datasets.humaneval.internal_humaneval_gen_ce6b06 import \
humaneval_datasets as humaneval_v2_datasets # noqa: F401, E501
from opencompass.configs.datasets.humaneval.internal_humaneval_gen_d2537e import \
humaneval_datasets # noqa: F401, E501
from opencompass.configs.datasets.math.math_4shot_base_gen_43d5b6 import \
math_datasets # noqa: F401, E501
from opencompass.configs.datasets.MathBench.mathbench_2024_few_shot_mixed_4a3fd4 import \
mathbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.mbpp.sanitized_mbpp_gen_742f0c import \
sanitized_mbpp_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu.mmlu_ppl_ac766d import \
mmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_few_shot_gen_bfaf90 import \
mmlu_pro_datasets # noqa: F401, E501
from opencompass.configs.datasets.nq.nq_open_1shot_gen_20a989 import \
nq_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_few_shot_ppl import \
race_datasets # noqa: F401, E501
from opencompass.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_few_shot_ppl import \
BoolQ_datasets # noqa: F401, E501
from opencompass.configs.datasets.TheoremQA.TheoremQA_5shot_gen_6f0af8 import \
TheoremQA_datasets # noqa: F401, E501
from opencompass.configs.datasets.triviaqa.triviaqa_wiki_1shot_gen_20a989 import \
triviaqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.wikibench.wikibench_few_shot_ppl_c23d79 import \
wikibench_datasets # noqa: F401, E501
from opencompass.configs.datasets.winogrande.winogrande_5shot_ll_252f01 import \
winogrande_datasets # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b import \
models as hf_internlm2_5_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b import \
models as lmdeploy_internlm2_5_7b_model # noqa: F401, E501
from opencompass.configs.summarizers.groups.bbh import \
bbh_summary_groups # noqa: F401, E501
# Summary Groups
from opencompass.configs.summarizers.groups.cmmlu import \
cmmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.GaokaoBench import \
GaokaoBench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mathbench_v1_2024 import \
mathbench_2024_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu import \
mmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups # noqa: F401, E501
from ...volc import infer as volc_infer # noqa: F401, E501
race_datasets = [race_datasets[1]] # Only take RACE-High
humaneval_v2_datasets[0]['abbr'] = 'openai_humaneval_v2'
bbh_datasets = [
x for x in bbh_datasets if 'logical_deduction_seven_objects' in x['abbr']
or 'multistep_arithmetic_two' in x['abbr']
]
cmmlu_datasets = [
x for x in cmmlu_datasets if x['abbr'].replace('cmmlu-', '') in [
'ancient_chinese', 'chinese_civil_service_exam',
'chinese_driving_rule', 'chinese_food_culture',
'chinese_foreign_policy', 'chinese_history', 'chinese_literature',
'chinese_teacher_qualification', 'construction_project_management',
'elementary_chinese', 'elementary_commonsense', 'ethnology',
'high_school_politics', 'modern_chinese',
'traditional_chinese_medicine'
]
]
mmlu_datasets = [
x for x in mmlu_datasets if x['abbr'].replace('lukaemon_mmlu_', '') in [
'business_ethics', 'clinical_knowledge', 'college_medicine',
'global_facts', 'human_aging', 'management', 'marketing',
'medical_genetics', 'miscellaneous', 'nutrition',
'professional_accounting', 'professional_medicine', 'virology'
]
]
mmlu_pro_datasets = [mmlu_pro_datasets[0]]
mathbench_datasets = [x for x in mathbench_datasets if 'college' in x['abbr']]
GaokaoBench_datasets = [
x for x in GaokaoBench_datasets if '2010-2022_Math_II_MCQs' in x['abbr']
or '2010-2022_Math_II_Fill-in-the-Blank' in x['abbr']
]
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], [])
summary_groups.append(
{
'name': 'Mathbench',
'subsets': ['mathbench-a (average)', 'mathbench-t (average)'],
}, )
summarizer = dict(
dataset_abbrs=[
'Language',
['race-high', 'accuracy'],
['ARC-c', 'accuracy'],
['BoolQ', 'accuracy'],
['triviaqa_wiki_1shot', 'score'],
['nq_open_1shot', 'score'],
'',
'General Reasoning',
['drop', 'accuracy'],
['bbh', 'naive_average'],
['GPQA_diamond', 'accuracy'],
['hellaswag', 'accuracy'],
['TheoremQA', 'score'],
['winogrande', 'accuracy'],
'',
'Math Calculation',
['gsm8k', 'accuracy'],
['GaokaoBench', 'weighted_average'],
'GaokaoBench_2010-2022_Math_II_MCQs',
'GaokaoBench_2010-2022_Math_II_Fill-in-the-Blank',
['math', 'accuracy'],
['Mathbench', 'naive_average'],
'',
'Knowledge',
['wikibench-wiki-single_choice_cncircular', 'perf_4'],
['cmmlu', 'naive_average'],
['mmlu', 'naive_average'],
['mmlu_pro', 'naive_average'],
'',
'Code',
['openai_humaneval', 'humaneval_pass@1'],
['openai_humaneval_v2', 'humaneval_pass@1'],
['sanitized_mbpp', 'score'],
'',
['dingo_en_192', 'score'],
['dingo_zh_170', 'score'],
'',
'mmlu',
'mmlu-stem',
'mmlu-social-science',
'mmlu-humanities',
['mmlu-other', 'accuracy'],
'',
'cmmlu',
'cmmlu-stem',
'cmmlu-social-science',
'cmmlu-humanities',
'cmmlu-other',
['cmmlu-china-specific', 'accuracy'],
'',
'mmlu_pro',
'mmlu_pro_biology',
'mmlu_pro_business',
'mmlu_pro_chemistry',
'mmlu_pro_computer_science',
'mmlu_pro_economics',
'mmlu_pro_engineering',
'mmlu_pro_health',
'mmlu_pro_history',
'mmlu_pro_law',
'mmlu_pro_math',
'mmlu_pro_philosophy',
'mmlu_pro_physics',
'mmlu_pro_psychology',
'mmlu_pro_other',
'',
'bbh-logical_deduction_seven_objects',
'bbh-multistep_arithmetic_two',
'###### MathBench-A: Application Part ######',
'college',
'high',
'middle',
'primary',
'arithmetic',
'mathbench-a (average)',
'###### MathBench-T: Theory Part ######',
'college_knowledge',
'high_knowledge',
'middle_knowledge',
'primary_knowledge',
'mathbench-t (average)',
],
summary_groups=summary_groups,
)
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
for d in datasets:
d['reader_cfg']['test_range'] = '[0:16]'
for m in models:
m['abbr'] = m['abbr'] + '_fullbench'
if 'turbomind' in m['abbr'] or 'lmdeploy' in m['abbr']:
m['engine_config']['max_batch_size'] = 1
m['batch_size'] = 1
models = sorted(models, key=lambda x: x['run_cfg']['num_gpus'])

View File

@ -0,0 +1,129 @@
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.gpqa.gpqa_openai_simple_evals_gen_5aeece import \
gpqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.gsm8k.gsm8k_gen_17d0dc import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_ppl import \
race_datasets # noqa: F401, E501
from opencompass.configs.datasets.winogrande.winogrande_5shot_ll_252f01 import \
winogrande_datasets # noqa: F401, E501
# read hf models - chat models
from opencompass.configs.models.chatglm.lmdeploy_glm4_9b import \
models as lmdeploy_glm4_9b_model # noqa: F401, E501
from opencompass.configs.models.deepseek.hf_deepseek_7b_base import \
models as hf_deepseek_7b_base_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_7b_base import \
models as lmdeploy_deepseek_7b_base_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_67b_base import \
models as lmdeploy_deepseek_67b_base_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_v2 import \
lmdeploy_deepseek_v2_model # noqa: F401, E501
from opencompass.configs.models.deepseek.vllm_deepseek_moe_16b_base import \
models as vllm_deepseek_moe_16b_base_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma2_2b import \
models as hf_gemma2_2b_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma2_9b import \
models as hf_gemma2_9b_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma_2b import \
models as hf_gemma_2b_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma_7b import \
models as hf_gemma_7b_model # noqa: F401, E501
from opencompass.configs.models.gemma.lmdeploy_gemma_9b import \
models as lmdeploy_gemma_9b_model # noqa: F401, E501
from opencompass.configs.models.gemma.vllm_gemma_2b import \
models as vllm_gemma_2b_model # noqa: F401, E501
from opencompass.configs.models.gemma.vllm_gemma_7b import \
models as vllm_gemma_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b import \
models as hf_internlm2_5_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_7b import \
models as hf_internlm2_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_1_8b import \
models as lmdeploy_internlm2_1_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b import \
models as lmdeploy_internlm2_5_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_7b import \
models as lmdeploy_internlm2_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_20b import \
models as lmdeploy_internlm2_20b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_base_7b import \
models as lmdeploy_internlm2_base_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_base_20b import \
models as lmdeploy_internlm2_base_20b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama2_7b import \
models as hf_llama2_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama3_1_8b import \
models as hf_llama3_1_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama3_8b import \
models as hf_llama3_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b import \
models as lmdeploy_llama3_1_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b import \
models as lmdeploy_llama3_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_70b import \
models as lmdeploy_llama3_70b_model # noqa: F401, E501
from opencompass.configs.models.mistral.hf_mistral_7b_v0_3 import \
models as hf_mistral_7b_v0_3_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.hf_qwen_2_5_7b import \
models as hf_qwen_2_5_7b_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.hf_qwen_2_5_14b import \
models as hf_qwen_2_5_14b_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.hf_qwen_2_5_32b import \
models as hf_qwen_2_5_32b_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_1_5b import \
models as lmdeploy_qwen2_5_1_5b_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b import \
models as lmdeploy_qwen2_5_7b_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_32b import \
models as lmdeploy_qwen2_5_32b_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_72b import \
models as lmdeploy_qwen2_5_72b_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen1_5_moe_a2_7b import \
models as hf_qwen1_5_moe_a2_7b_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen2_0_5b import \
models as hf_qwen2_0_5b_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen2_1_5b import \
models as hf_qwen2_1_5b_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen2_7b import \
models as hf_qwen2_7b_model # noqa: F401, E501
from opencompass.configs.models.qwen.lmdeploy_qwen2_1_5b import \
models as lmdeploy_qwen2_1_5b_model # noqa: F401, E501
from opencompass.configs.models.qwen.lmdeploy_qwen2_7b import \
models as lmdeploy_qwen2_7b_model # noqa: F401, E501
from opencompass.configs.models.qwen.vllm_qwen1_5_0_5b import \
models as vllm_qwen1_5_0_5b_model # noqa: F401, E501
from opencompass.configs.models.yi.hf_yi_1_5_6b import \
models as hf_yi_1_5_6b_model # noqa: F401, E501
from opencompass.configs.models.yi.hf_yi_1_5_9b import \
models as hf_yi_1_5_9b_model # noqa: F401, E501
from opencompass.configs.models.yi.lmdeploy_yi_1_5_9b import \
models as lmdeploy_yi_1_5_9b_model # noqa: F401, E501
from ...volc import infer as volc_infer # noqa: F401, E501
race_datasets = [race_datasets[1]]
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
for d in datasets:
d['reader_cfg']['test_range'] = '[0:32]'
for m in models:
if 'turbomind' in m['abbr'] or 'lmdeploy' in m['abbr']:
m['engine_config']['max_batch_size'] = 1
m['batch_size'] = 1
models = sorted(models, key=lambda x: x['run_cfg']['num_gpus'])
summarizer = dict(
dataset_abbrs=[
['gsm8k', 'accuracy'],
['GPQA_diamond', 'accuracy'],
['race-high', 'accuracy'],
['winogrande', 'accuracy'],
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)

View File

@ -0,0 +1,193 @@
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_gen import \
race_datasets # noqa: F401, E501
# read hf models - chat models
from opencompass.configs.models.chatglm.hf_glm4_9b_chat import \
models as hf_glm4_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.chatglm.lmdeploy_glm4_9b_chat import \
models as lmdeploy_glm4_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.chatglm.vllm_glm4_9b_chat import \
models as vllm_glm4_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.deepseek.hf_deepseek_7b_chat import \
models as hf_deepseek_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_67b_chat import \
models as lmdeploy_deepseek_67b_chat_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_r1_distill_llama_8b import \
models as \
lmdeploy_deepseek_r1_distill_llama_8b_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_r1_distill_llama_70b import \
models as \
lmdeploy_deepseek_r1_distill_llama_70b_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_r1_distill_qwen_1_5b import \
models as \
lmdeploy_deepseek_r1_distill_qwen_1_5b_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_r1_distill_qwen_32b import \
models as \
lmdeploy_deepseek_r1_distill_qwen_32b_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_v2_5_1210 import \
models as lmdeploy_deepseek_v2_5_1210_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_v2_lite import \
models as lmdeploy_deepseek_v2_lite_model # noqa: F401, E501
from opencompass.configs.models.deepseek.vllm_deepseek_7b_chat import \
models as vllm_deepseek_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma2_2b_it import \
models as hf_gemma2_2b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma2_9b_it import \
models as hf_gemma2_9b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma2_27b_it import \
models as hf_gemma2_27b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma_2b_it import \
models as hf_gemma_2b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma_7b_it import \
models as hf_gemma_7b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.lmdeploy_gemma_9b_it import \
models as lmdeploy_gemma_9b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.lmdeploy_gemma_27b_it import \
models as lmdeploy_gemma_27b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.vllm_gemma_7b_it import \
models as vllm_gemma_7b_it_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b_chat import \
models as hf_internlm2_5_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_20b_chat import \
models as hf_internlm2_5_20b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm3_8b_instruct import \
models as hf_internlm3_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as lmdeploy_internlm2_5_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_20b_chat import \
models as lmdeploy_internlm2_5_20b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_1_8b import \
models as lmdeploy_internlm2_chat_1_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_1_8b_sft import \
models as lmdeploy_internlm2_chat_1_8b_sft_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b import \
models as lmdeploy_internlm2_chat_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b_sft import \
models as lmdeploy_internlm2_chat_7b_sft_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm3_8b_instruct import \
models as lmdeploy_internlm3_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.vllm_internlm2_chat_7b import \
models as vllm_internlm2_chat_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama3_1_8b_instruct import \
models as hf_llama3_1_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama3_2_3b_instruct import \
models as hf_llama3_2_3b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama3_8b_instruct import \
models as hf_llama3_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama2_7b_chat import \
models as lmdeploy_llama2_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import \
models as lmdeploy_llama3_1_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_2_3b_instruct import \
models as lmdeploy_llama3_2_3b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_3_70b_instruct import \
models as lmdeploy_llama3_3_70b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b_instruct import \
models as lmdeploy_llama3_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.mistral.hf_mistral_7b_instruct_v0_2 import \
models as hf_mistral_7b_instruct_v0_2_model # noqa: F401, E501
from opencompass.configs.models.mistral.hf_mistral_7b_instruct_v0_3 import \
models as hf_mistral_7b_instruct_v0_3_model # noqa: F401, E501
from opencompass.configs.models.mistral.hf_mistral_nemo_instruct_2407 import \
models as hf_mistral_nemo_instruct_2407_model # noqa: F401, E501
from opencompass.configs.models.mistral.hf_mistral_small_instruct_2409 import \
models as hf_mistral_small_instruct_2409_model # noqa: F401, E501
from opencompass.configs.models.mistral.lmdeploy_mistral_large_instruct_2411 import \
models as \
lmdeploy_mistral_large_instruct_2411_model # noqa: F401, E501
from opencompass.configs.models.mistral.lmdeploy_mistral_nemo_instruct_2407 import \
models as lmdeploy_mistral_nemo_instruct_2407_model # noqa: F401, E501
from opencompass.configs.models.mistral.lmdeploy_mistral_small_instruct_2409 import \
models as \
lmdeploy_mistral_small_instruct_2409_model # noqa: F401, E501
from opencompass.configs.models.mistral.lmdeploy_mixtral_8x22b_instruct_v0_1 import \
models as \
lmdeploy_mixtral_8x22b_instruct_v0_1_model # noqa: F401, E501
from opencompass.configs.models.mistral.vllm_mistral_7b_instruct_v0_1 import \
models as vllm_mistral_7b_instruct_v0_1_model # noqa: F401, E501
from opencompass.configs.models.mistral.vllm_mistral_7b_instruct_v0_2 import \
models as vllm_mistral_7b_instruct_v0_2_model # noqa: F401, E501
from opencompass.configs.models.mistral.vllm_mixtral_8x22b_instruct_v0_1 import \
models as vllm_mixtral_8x22b_instruct_v0_1_model # noqa: F401, E501
from opencompass.configs.models.nvidia.lmdeploy_nemotron_70b_instruct_hf import \
models as lmdeploy_nemotron_70b_instruct_hf_model # noqa: F401, E501
from opencompass.configs.models.phi.hf_phi_4 import \
models as hf_phi_4_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.hf_qwen2_5_0_5b_instruct import \
models as hf_qwen2_5_0_5b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.hf_qwen2_5_3b_instruct import \
models as hf_qwen2_5_3b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.hf_qwen2_5_14b_instruct import \
models as hf_qwen2_5_14b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_0_5b_instruct import \
models as lmdeploy_qwen2_5_0_5b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_3b_instruct import \
models as lmdeploy_qwen2_5_3b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import \
models as lmdeploy_qwen2_5_14b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_72b_instruct import \
models as lmdeploy_qwen2_5_72b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen1_5_0_5b_chat import \
models as hf_qwen1_5_0_5b_chat_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen2_1_5b_instruct import \
models as hf_qwen2_1_5b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen2_7b_instruct import \
models as hf_qwen2_7b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.lmdeploy_qwen2_1_5b_instruct import \
models as lmdeploy_qwen2_1_5b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.lmdeploy_qwen2_7b_instruct import \
models as lmdeploy_qwen2_7b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.vllm_qwen1_5_0_5b_chat import \
models as vllm_qwen1_5_0_5b_chat_model # noqa: F401, E501
from opencompass.configs.models.yi.hf_yi_1_5_6b_chat import \
models as hf_yi_1_5_6b_chat_model # noqa: F401, E501
from opencompass.configs.models.yi.hf_yi_1_5_9b_chat import \
models as hf_yi_1_5_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.yi.lmdeploy_yi_1_5_6b_chat import \
models as lmdeploy_yi_1_5_6b_chat_model # noqa: F401, E501
from opencompass.configs.models.yi.lmdeploy_yi_1_5_9b_chat import \
models as lmdeploy_yi_1_5_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.yi.lmdeploy_yi_1_5_34b_chat import \
models as lmdeploy_yi_1_5_34b_chat_model # noqa: F401, E501
from ...volc import infer as volc_infer # noqa: F401, E501
hf_glm4_9b_chat_model[0]['path'] = 'THUDM/glm-4-9b-chat-hf'
race_datasets = [race_datasets[1]]
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
for d in datasets:
d['reader_cfg']['test_range'] = '[0:32]'
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
for m in models:
if 'turbomind' in m['abbr'] or 'lmdeploy' in m['abbr']:
m['engine_config']['max_batch_size'] = 1
m['batch_size'] = 1
models = sorted(models, key=lambda x: x['run_cfg']['num_gpus'])
summarizer = dict(
dataset_abbrs=[
'gsm8k',
'race-middle',
'race-high',
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)

View File

@ -0,0 +1,317 @@
from mmengine.config import read_base
with read_base():
# read hf models - chat models
# Dataset
from opencompass.configs.datasets.aime2024.aime2024_gen_6e39a4 import \
aime2024_datasets # noqa: F401, E501
from opencompass.configs.datasets.ARC_c.ARC_c_cot_gen_926652 import \
ARC_c_datasets # noqa: F401, E501
# remove because of oom
# from opencompass.configs.datasets.ARC_Prize_Public_Evaluation.arc_prize_public_evaluation_gen_872059 import arc_prize_public_evaluation_datasets # noqa: F401, E501
from opencompass.configs.datasets.bbh.bbh_gen_5b92b0 import \
bbh_datasets # noqa: F401, E501
from opencompass.configs.datasets.bigcodebench.bigcodebench_hard_complete_gen_faf748 import \
bigcodebench_hard_complete_datasets # noqa: F401, E501
from opencompass.configs.datasets.bigcodebench.bigcodebench_hard_instruct_gen_8815eb import \
bigcodebench_hard_instruct_datasets # noqa: F401, E501
from opencompass.configs.datasets.cmmlu.cmmlu_0shot_cot_gen_305931 import \
cmmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.cmo_fib.cmo_fib_gen_ace24b import \
cmo_fib_datasets # noqa: F401, E501
from opencompass.configs.datasets.drop.drop_openai_simple_evals_gen_3857b0 import \
drop_datasets # noqa: F401, E501
from opencompass.configs.datasets.ds1000.ds1000_service_eval_gen_cbc84f import \
ds1000_datasets # noqa: F401, E501
from opencompass.configs.datasets.GaokaoBench.GaokaoBench_no_subjective_gen_4c31db import \
GaokaoBench_datasets # noqa: F401, E501
from opencompass.configs.datasets.gpqa.gpqa_openai_simple_evals_gen_5aeece import \
gpqa_datasets # noqa: F401, E501
# new datasets in Fullbench v1.1
from opencompass.configs.datasets.gsm8k.gsm8k_0shot_v2_gen_6e39a4 import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.hellaswag.hellaswag_10shot_gen_e42710 import \
hellaswag_datasets # noqa: F401, E501
from opencompass.configs.datasets.humaneval.humaneval_openai_sample_evals_gen_dcae0e import \
humaneval_datasets # noqa: F401, E501
from opencompass.configs.datasets.humanevalx.humanevalx_gen_3d84a3 import \
humanevalx_datasets # noqa: F401, E501
from opencompass.configs.datasets.IFEval.IFEval_gen_353ae7 import \
ifeval_datasets # noqa: F401, E501
from opencompass.configs.datasets.korbench.korbench_single_0_shot_gen import \
korbench_0shot_single_datasets # noqa: F401, E501
from opencompass.configs.datasets.livecodebench.livecodebench_gen_b2b0fd import \
LCB_datasets # noqa: F401, E501
from opencompass.configs.datasets.math.math_0shot_gen_11c4b5 import \
math_datasets # noqa: F401, E501
from opencompass.configs.datasets.MathBench.mathbench_2024_gen_50a320 import \
mathbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.mbpp.sanitized_mbpp_mdblock_gen_a447ff import \
sanitized_mbpp_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu.mmlu_openai_simple_evals_gen_b618ea import \
mmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_cot_gen_08c1de import \
mmlu_pro_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmmlu_lite.mmmlu_lite_gen_c51a84 import \
mmmlu_lite_datasets # noqa: F401, E501
from opencompass.configs.datasets.musr.musr_gen_3622bb import \
musr_datasets # noqa: F401, E501
from opencompass.configs.datasets.nq.nq_open_1shot_gen_2e45e5 import \
nq_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_cot_gen_d95929 import \
race_datasets # noqa: F401, E501
from opencompass.configs.datasets.scicode.scicode_gen_085b98 import \
SciCode_datasets # noqa: F401, E501
from opencompass.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_cot_gen_1d56df import \
BoolQ_datasets # noqa: F401, E501
from opencompass.configs.datasets.teval.teval_en_gen_1ac254 import \
teval_datasets as teval_en_datasets # noqa: F401, E501
from opencompass.configs.datasets.teval.teval_zh_gen_1ac254 import \
teval_datasets as teval_zh_datasets # noqa: F401, E501
from opencompass.configs.datasets.TheoremQA.TheoremQA_5shot_gen_6f0af8 import \
TheoremQA_datasets # noqa: F401, E501
from opencompass.configs.datasets.triviaqa.triviaqa_wiki_1shot_gen_bc5f21 import \
triviaqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.wikibench.wikibench_gen_0978ad import \
wikibench_datasets # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b_chat import \
models as hf_internlm2_5_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as lmdeploy_internlm2_5_7b_chat_model # noqa: F401, E501
# Summary Groups
# Summary Groups
from opencompass.configs.summarizers.groups.bbh import \
bbh_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.cmmlu import \
cmmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.ds1000 import \
ds1000_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.GaokaoBench import \
GaokaoBench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.humanevalx import \
humanevalx_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.korbench import \
korbench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mathbench_v1_2024 import \
mathbench_2024_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu import \
mmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.musr_average import \
summarizer as musr_summarizer # noqa: F401, E501
from opencompass.configs.summarizers.groups.scicode import \
scicode_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.teval import \
teval_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.mmmlu_lite import \
mmmlu_summary_groups # noqa: F401, E501
from ...volc import infer as volc_infer # noqa: F401, E501
# For HumanEval-X Evaluation
# Apply the evaluator ip_address and port
race_datasets = [race_datasets[1]]
for item in humanevalx_datasets:
item['eval_cfg']['evaluator'][
'ip_address'] = 'codeeval.opencompass.org.cn/humanevalx'
item['eval_cfg']['evaluator']['port'] = ''
# For DS-1000 Evaluation
# Apply the evaluator ip_address and port
for item in ds1000_datasets:
item['eval_cfg']['evaluator'][
'ip_address'] = 'codeeval.opencompass.org.cn/ds1000'
item['eval_cfg']['evaluator']['port'] = ''
bbh_datasets = [
x for x in bbh_datasets if 'logical_deduction_seven_objects' in x['abbr']
or 'multistep_arithmetic_two' in x['abbr']
]
cmmlu_datasets = [
x for x in cmmlu_datasets if x['abbr'].replace('cmmlu-', '') in [
'ancient_chinese', 'chinese_civil_service_exam',
'chinese_driving_rule', 'chinese_food_culture',
'chinese_foreign_policy', 'chinese_history', 'chinese_literature',
'chinese_teacher_qualification', 'construction_project_management',
'elementary_chinese', 'elementary_commonsense', 'ethnology',
'high_school_politics', 'modern_chinese',
'traditional_chinese_medicine'
]
]
mmlu_datasets = [
x for x in mmlu_datasets if x['abbr'].replace('lukaemon_mmlu_', '') in [
'business_ethics', 'clinical_knowledge', 'college_medicine',
'global_facts', 'human_aging', 'management', 'marketing',
'medical_genetics', 'miscellaneous', 'nutrition',
'professional_accounting', 'professional_medicine', 'virology'
]
]
mmlu_pro_datasets = [mmlu_pro_datasets[0]]
mmmlu_lite_datasets = [
x for x in mmmlu_lite_datasets if 'mmlu_lite_AR-XY' in x['abbr']
]
mathbench_datasets = [x for x in mathbench_datasets if 'college' in x['abbr']]
GaokaoBench_datasets = [
x for x in GaokaoBench_datasets if '2010-2022_Math_II_MCQs' in x['abbr']
or '2010-2022_Math_II_Fill-in-the-Blank' in x['abbr']
]
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')
and 'scicode' not in k.lower() and 'teval' not in k),
[],
)
datasets += teval_en_datasets
datasets += teval_zh_datasets
# datasets += SciCode_datasets
musr_summary_groups = musr_summarizer['summary_groups']
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], [])
summary_groups.append(
{
'name': 'Mathbench',
'subsets': ['mathbench-a (average)', 'mathbench-t (average)'],
}, )
# Summarizer
summarizer = dict(
dataset_abbrs=[
'Language',
['race-high', 'accuracy'],
['ARC-c', 'accuracy'],
['BoolQ', 'accuracy'],
['triviaqa_wiki_1shot', 'score'],
['nq_open_1shot', 'score'],
['mmmlu_lite', 'naive_average'],
'',
'Instruction Following',
['IFEval', 'Prompt-level-strict-accuracy'],
'',
'General Reasoning',
['drop', 'accuracy'],
['bbh', 'naive_average'],
['GPQA_diamond', 'accuracy'],
['hellaswag', 'accuracy'],
['TheoremQA', 'score'],
['musr_average', 'naive_average'],
['korbench_single', 'naive_average'],
['ARC_Prize_Public_Evaluation', 'accuracy'],
'',
'Math Calculation',
['gsm8k', 'accuracy'],
['GaokaoBench', 'weighted_average'],
['math', 'accuracy'],
['cmo_fib', 'accuracy'],
['aime2024', 'accuracy'],
['Mathbench', 'naive_average'],
'',
'Knowledge',
['wikibench-wiki-single_choice_cncircular', 'perf_4'],
['cmmlu', 'naive_average'],
['mmlu', 'naive_average'],
['mmlu_pro', 'naive_average'],
'',
'Code',
['openai_humaneval', 'humaneval_pass@1'],
['sanitized_mbpp', 'score'],
['humanevalx', 'naive_average'],
['ds1000', 'naive_average'],
['lcb_code_generation', 'pass@1'],
['lcb_code_execution', 'pass@1'],
['lcb_test_output', 'pass@1'],
['bigcodebench_hard_instruct', 'pass@1'],
['bigcodebench_hard_complete', 'pass@1'],
'',
'Agent',
['teval', 'naive_average'],
['SciCode', 'accuracy'],
['SciCode', 'sub_accuracy'],
'',
'bbh-logical_deduction_seven_objects',
'bbh-multistep_arithmetic_two',
'',
'mmlu',
'mmlu-stem',
'mmlu-social-science',
'mmlu-humanities',
'mmlu-other',
'',
'cmmlu',
'cmmlu-stem',
'cmmlu-social-science',
'cmmlu-humanities',
'cmmlu-other',
'cmmlu-china-specific',
'',
'mmlu_pro',
'mmlu_pro_biology',
'mmlu_pro_business',
'mmlu_pro_chemistry',
'mmlu_pro_computer_science',
'mmlu_pro_economics',
'mmlu_pro_engineering',
'mmlu_pro_health',
'mmlu_pro_history',
'mmlu_pro_law',
'mmlu_pro_math',
'mmlu_pro_philosophy',
'mmlu_pro_physics',
'mmlu_pro_psychology',
'mmlu_pro_other',
'',
'ds1000_Pandas',
'ds1000_Numpy',
'ds1000_Tensorflow',
'ds1000_Scipy',
'ds1000_Sklearn',
'ds1000_Pytorch',
'ds1000_Matplotlib',
'',
'mmmlu_lite',
'openai_mmmlu_lite_AR-XY',
'openai_mmmlu_lite_BN-BD',
'openai_mmmlu_lite_DE-DE',
'openai_mmmlu_lite_ES-LA',
'openai_mmmlu_lite_FR-FR',
'openai_mmmlu_lite_HI-IN',
'openai_mmmlu_lite_ID-ID',
'openai_mmmlu_lite_IT-IT',
'openai_mmmlu_lite_JA-JP',
'openai_mmmlu_lite_KO-KR',
'openai_mmmlu_lite_PT-BR',
'openai_mmmlu_lite_SW-KE',
'openai_mmmlu_lite_YO-NG',
'openai_mmmlu_lite_ZH-CN',
'',
'###### MathBench-A: Application Part ######',
'college',
'high',
'middle',
'primary',
'arithmetic',
'mathbench-a (average)',
'###### MathBench-T: Theory Part ######',
'college_knowledge',
'high_knowledge',
'middle_knowledge',
'primary_knowledge',
'mathbench-t (average)',
],
summary_groups=summary_groups,
)
for d in datasets:
d['reader_cfg']['test_range'] = '[0:16]'
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
for m in models:
m['abbr'] = m['abbr'] + '_fullbench'
if 'turbomind' in m['abbr'] or 'lmdeploy' in m['abbr']:
m['engine_config']['max_batch_size'] = 1
m['batch_size'] = 1
models = sorted(models, key=lambda x: x['run_cfg']['num_gpus'])

View File

@ -0,0 +1,182 @@
from copy import deepcopy
from mmengine.config import read_base
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.summarizers import DefaultSubjectiveSummarizer
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
with read_base():
# read hf models - chat models
# Dataset
from opencompass.configs.datasets.chinese_simpleqa.chinese_simpleqa_gen import \
csimpleqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.SimpleQA.simpleqa_gen_0283c3 import \
simpleqa_datasets # noqa: F401, E501; noqa: F401, E501
from opencompass.configs.datasets.subjective.alignbench.alignbench_v1_1_judgeby_critiquellm_new import \
alignbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.alpaca_eval.alpacav2_judgeby_gpt4_new import \
alpacav2_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.arena_hard.arena_hard_compare_new import \
arenahard_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.compassarena.compassarena_compare_new import \
compassarena_datasets # noqa: F401, E501
# from opencompass.configs.datasets.subjective.fofo.fofo_bilingual_judge_new import fofo_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.followbench.followbench_llmeval_new import \
followbench_llmeval_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.multiround.mtbench101_judge_new import \
mtbench101_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.wildbench.wildbench_pair_judge_new import \
wildbench_datasets # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b_chat import \
models as hf_internlm2_5_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as lmdeploy_internlm2_5_7b_chat_model # noqa: F401, E501
from ...volc import infer as volc_infer # noqa: F401, E501
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')
and 'mtbench101' not in k and 'wildbench' not in k), [])
datasets += mtbench101_datasets # noqa: F401, E501
datasets += wildbench_datasets # noqa: F401, E501
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
for m in models:
m['abbr'] = m['abbr'] + '_fullbench'
if 'turbomind' in m['abbr'] or 'lmdeploy' in m['abbr']:
m['engine_config']['max_batch_size'] = 1
m['batch_size'] = 1
models = sorted(models, key=lambda x: x['run_cfg']['num_gpus'])
judge_models = deepcopy([models[1]])
judge_models[0]['abbr'] = judge_models[0]['abbr'] + '-judge'
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
models=models,
judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=SubjectiveEvalTask)),
)
summary_groups = []
summary_groups.append({
'name': 'compassarena_language',
'subsets': [
['compassarena_language', '内容总结'],
],
})
summary_groups.append({
'name': 'compassarena_knowledge',
'subsets': [
['compassarena_knowledge', '生活常识_ZH'],
],
})
summary_groups.append({
'name': 'compassarena_reason_v2',
'subsets': [
['compassarena_reason_v2', 'reasoning'],
],
})
summary_groups.append({
'name': 'compassarena_math_v2',
'subsets': [
['compassarena_math_v2', '高等数学_ZH'],
],
})
summary_groups.append({
'name': 'compassarena_creationv2_zh',
'subsets': [
['compassarena_creationv2_zh', '内容扩写_ZH'],
],
})
summary_groups.append({
'name':
'CompassArena',
'subsets': [
'compassarena_language',
'compassarena_knowledge',
'compassarena_reason_v2',
'compassarena_math_v2',
'compassarena_creationv2_zh',
],
})
summary_groups.append({
'name':
'FoFo',
'subsets': [['fofo_test_prompts', 'overall'],
['fofo_test_prompts_cn', 'overall']],
})
summary_groups.append({
'name':
'Followbench',
'subsets': [
['followbench_llmeval_en', 'HSR_AVG'],
['followbench_llmeval_en', 'SSR_AVG'],
],
})
# Summarizer
summarizer = dict(
dataset_abbrs=[
['alignment_bench_v1_1', '总分'],
['alpaca_eval', 'total'],
['arenahard', 'score'],
['Followbench', 'naive_average'],
['CompassArena', 'naive_average'],
['FoFo', 'naive_average'],
['mtbench101', 'avg'],
['wildbench', 'average'],
['simpleqa', 'accuracy_given_attempted'],
['chinese_simpleqa', 'given_attempted_accuracy'],
'',
['alignment_bench_v1_1', '专业能力'],
['alignment_bench_v1_1', '数学计算'],
['alignment_bench_v1_1', '基本任务'],
['alignment_bench_v1_1', '逻辑推理'],
['alignment_bench_v1_1', '中文理解'],
['alignment_bench_v1_1', '文本写作'],
['alignment_bench_v1_1', '角色扮演'],
['alignment_bench_v1_1', '综合问答'],
['alpaca_eval', 'helpful_base'],
['alpaca_eval', 'koala'],
['alpaca_eval', 'oasst'],
['alpaca_eval', 'selfinstruct'],
['alpaca_eval', 'vicuna'],
['compassarena_language', 'naive_average'],
['compassarena_knowledge', 'naive_average'],
['compassarena_reason_v2', 'naive_average'],
['compassarena_math_v2', 'naive_average'],
['compassarena_creationv2_zh', 'naive_average'],
['fofo_test_prompts', 'overall'],
['fofo_test_prompts_cn', 'overall'],
['followbench_llmeval_en', 'HSR_AVG'],
['followbench_llmeval_en', 'SSR_AVG'],
['followbench_llmeval_en', 'HSR_L1'],
['followbench_llmeval_en', 'HSR_L2'],
['followbench_llmeval_en', 'HSR_L3'],
['followbench_llmeval_en', 'HSR_L4'],
['followbench_llmeval_en', 'HSR_L5'],
['followbench_llmeval_en', 'SSR_L1'],
['followbench_llmeval_en', 'SSR_L2'],
['followbench_llmeval_en', 'SSR_L3'],
['followbench_llmeval_en', 'SSR_L4'],
['followbench_llmeval_en', 'SSR_L5'],
['simpleqa', 'f1'],
],
type=DefaultSubjectiveSummarizer,
summary_groups=summary_groups,
)

383
.github/scripts/oc_score_assert.py vendored Normal file
View File

@ -0,0 +1,383 @@
import csv
import os
import pytest
import yaml
output_path = 'regression_result_daily'
def model_list(type):
config_path = '.github/scripts/oc_score_baseline_testrange.yaml'
with open(config_path) as f:
config = yaml.load(f.read(), Loader=yaml.SafeLoader)
return config.get(type).keys()
def dataset_list(model, type):
config_path = '.github/scripts/oc_score_baseline_fullbench.yaml'
with open(config_path) as f:
config = yaml.load(f.read(), Loader=yaml.SafeLoader)
return config.get(model).get(type).keys()
@pytest.fixture()
def baseline_scores_testrange(request):
config_path = os.path.join(
request.config.rootdir,
'.github/scripts/oc_score_baseline_testrange.yaml')
with open(config_path) as f:
config = yaml.load(f.read(), Loader=yaml.SafeLoader)
return config
@pytest.fixture()
def baseline_scores(request):
config_path = os.path.join(request.config.rootdir,
'.github/scripts/oc_score_baseline.yaml')
with open(config_path) as f:
config = yaml.load(f.read(), Loader=yaml.SafeLoader)
return config
@pytest.fixture()
def baseline_scores_fullbench(request):
config_path = os.path.join(
request.config.rootdir,
'.github/scripts/oc_score_baseline_fullbench.yaml')
with open(config_path) as f:
config = yaml.load(f.read(), Loader=yaml.SafeLoader)
return config
@pytest.fixture()
def result_scores():
file = find_csv_files(output_path)
if file is None:
return None
return read_csv_file(file)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores_testrange')
@pytest.mark.chat_models
class TestChat:
"""Test cases for chat model."""
@pytest.mark.parametrize(
'model, dataset', [(p1, p2) for p1 in model_list('chat')
for p2 in ['gsm8k_accuracy', 'race-high_accuracy']])
def test_model_dataset_score(self, baseline_scores_testrange,
result_scores, model, dataset):
base_score = baseline_scores_testrange.get('chat').get(model).get(
dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores_testrange')
@pytest.mark.base_models
class TestBase:
"""Test cases for base model."""
@pytest.mark.parametrize('model, dataset',
[(p1, p2) for p1 in model_list('base') for p2 in [
'gsm8k_accuracy', 'GPQA_diamond_accuracy',
'race-high_accuracy', 'winogrande_accuracy'
]])
def test_model_dataset_score(self, baseline_scores_testrange,
result_scores, model, dataset):
if model in ['gemma-2b-vllm', 'gemma-7b-vllm'
] and dataset != 'gsm8k_accuracy':
return
base_score = baseline_scores_testrange.get('base').get(model).get(
dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores_fullbench')
@pytest.mark.chat_obj_fullbench
class TestChatObjFullbench:
"""Test cases for chat model."""
@pytest.mark.parametrize('model, dataset', [(p1, p2) for p1 in [
'internlm2_5-7b-chat-hf_fullbench',
'internlm2_5-7b-chat-turbomind_fullbench'
] for p2 in dataset_list('internlm2_5-7b-chat-hf_fullbench', 'objective')])
def test_model_dataset_score(self, baseline_scores_fullbench,
result_scores, model, dataset):
base_score = baseline_scores_fullbench.get(model).get('objective').get(
dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores_fullbench')
@pytest.mark.chat_sub_fullbench
class TestChatSubFullbench:
"""Test cases for chat model."""
@pytest.mark.parametrize('model, dataset', [(p1, p2) for p1 in [
'internlm2_5-7b-chat-hf_fullbench',
'internlm2_5-7b-chat-turbomind_fullbench'
] for p2 in dataset_list('internlm2_5-7b-chat-hf_fullbench', 'subjective')]
)
def test_model_dataset_score(self, baseline_scores_fullbench,
result_scores, model, dataset):
base_score = baseline_scores_fullbench.get(model).get(
'subjective').get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores_fullbench')
@pytest.mark.base_fullbench
class TestBaseFullbench:
"""Test cases for chat model."""
@pytest.mark.parametrize(
'model, dataset',
[(p1, p2) for p1 in
['internlm2_5-7b-hf_fullbench', 'internlm2_5-7b-turbomind_fullbench']
for p2 in dataset_list('internlm2_5-7b-hf_fullbench', 'objective')])
def test_model_dataset_score(self, baseline_scores_fullbench,
result_scores, model, dataset):
base_score = baseline_scores_fullbench.get(model).get('objective').get(
dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores')
@pytest.mark.api
class TestApibench:
"""Test cases for chat model."""
@pytest.mark.parametrize('model, dataset',
[('lmdeploy-api-test', 'race-middle_accuracy'),
('lmdeploy-api-test', 'race-high_accuracy'),
('lmdeploy-api-test', 'gsm8k_accuracy')])
def test_api(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores_fullbench')
@pytest.mark.volc_fullbench
class TestVolcFullbench:
"""Test cases for chat model."""
@pytest.mark.parametrize('model, dataset', [(p1, p2) for p1 in [
'internlm2_5-7b-chat-turbomind', 'qwen2.5-7b-instruct-turbomind',
'internlm2_5-7b-chat-pytorch', 'qwen2.5-7b-instruct-pytorch',
'internlm3-8b-instruct-turbomind', 'internlm3-8b-instruct-pytorch'
] for p2 in dataset_list(p1, 'objective')])
@pytest.mark.chat_objective
def test_chat_objective(self, baseline_scores_fullbench, result_scores,
model, dataset):
base_score = baseline_scores_fullbench.get(model).get('objective').get(
dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.parametrize('model, dataset', [
(p1, p2) for p1 in ['internlm2_5-7b-chat-turbomind']
for p2 in dataset_list('internlm2_5-7b-chat-turbomind', 'subjective')
])
@pytest.mark.chat_subjective
def test_chat_subjective(self, baseline_scores_fullbench, result_scores,
model, dataset):
base_score = baseline_scores_fullbench.get(model).get(
'subjective').get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.parametrize(
'model, dataset',
[(p1, p2) for p1 in ['internlm2_5-7b-turbomind']
for p2 in dataset_list('internlm2_5-7b-turbomind', 'objective')])
@pytest.mark.base_objective
def test_base_objective(self, baseline_scores_fullbench, result_scores,
model, dataset):
base_score = baseline_scores_fullbench.get(model).get('objective').get(
dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.parametrize(
'model, dataset',
[(p1, p2) for p1 in ['internlm2_5-7b-turbomind']
for p2 in dataset_list('internlm2_5-7b-turbomind', 'long_context')])
@pytest.mark.base_long_context
def test_base_long_context(self, baseline_scores_fullbench, result_scores,
model, dataset):
base_score = baseline_scores_fullbench.get(model).get(
'long_context').get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.parametrize(
'model, dataset',
[(p1, p2)
for p1 in ['internlm2_5-7b-chat-1m-turbomind'] for p2 in dataset_list(
'internlm2_5-7b-chat-1m-turbomind', 'long_context')])
@pytest.mark.chat_long_context
def test_chat_long_context(self, baseline_scores_fullbench, result_scores,
model, dataset):
base_score = baseline_scores_fullbench.get(model).get(
'long_context').get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores')
class TestCmdCase:
@pytest.mark.case1
@pytest.mark.parametrize('model, dataset',
[('internlm2_5-7b-hf', 'race-middle_accuracy'),
('internlm2_5-7b-hf', 'race-high_accuracy'),
('internlm2_5-7b-hf', 'demo_gsm8k_accuracy')])
def test_cmd_case1(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.case2
@pytest.mark.parametrize(
'model, dataset',
[('internlm2_5-7b-chat-lmdeploy', 'race-middle_accuracy'),
('internlm2_5-7b-chat-lmdeploy', 'race-high_accuracy'),
('internlm2_5-7b-chat-lmdeploy', 'demo_gsm8k_accuracy'),
('internlm3-8b-instruct-lmdeploy', 'race-middle_accuracy'),
('internlm3-8b-instruct-lmdeploy', 'race-high_accuracy'),
('internlm3-8b-instruct-lmdeploy', 'demo_gsm8k_accuracy')])
def test_cmd_case2(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.case3
@pytest.mark.parametrize('model, dataset',
[('internlm2_5-7b_hf', 'race-middle_accuracy'),
('internlm2_5-7b_hf', 'race-high_accuracy'),
('internlm2_5-7b_hf', 'demo_gsm8k_accuracy')])
def test_cmd_case3(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.case4
@pytest.mark.parametrize(
'model, dataset',
[('internlm3-8b-instruct_hf-lmdeploy', 'race-middle_accuracy'),
('internlm3-8b-instruct_hf-lmdeploy', 'race-high_accuracy'),
('internlm3-8b-instruct_hf-lmdeploy', 'demo_gsm8k_accuracy')])
def test_cmd_case4(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.case5
@pytest.mark.parametrize(
'model, dataset',
[('internlm3-8b-instruct_hf-vllm', 'race-middle_accuracy'),
('internlm3-8b-instruct_hf-vllm', 'race-high_accuracy'),
('internlm3-8b-instruct_hf-vllm', 'demo_gsm8k_accuracy')])
def test_cmd_case5(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
def assert_score(model_type, score, baseline, dataset: str = ''):
if score is None or score == '-':
assert False, 'value is none'
if 'batch' not in model_type:
if float(score) <= (baseline + 0.01) and float(score) >= (baseline -
0.01):
print(' '.join([score, 'is equal', str(baseline)]))
assert True
else:
print(' '.join([score, 'is not equal', str(baseline)]))
assert False, ' '.join([score, 'is not equal', str(baseline)])
else:
if dataset.startswith('dingo') or dataset.startswith(
'GPQA') or dataset.startswith('high') or dataset.startswith(
'mmlu_pro_') or dataset.startswith(
'alpaca_eval') or dataset.startswith('compassarena_'):
threshold = 5
elif dataset.startswith('humanevalx') or dataset == 'large_threshold':
threshold = 10
else:
threshold = 3
if float(score) <= (baseline + threshold) and float(score) >= (
baseline - threshold):
print(' '.join([
score, 'is between',
str(baseline - threshold), 'and',
str(baseline + threshold)
]))
assert True
else:
print(' '.join([
score, 'is not between',
str(baseline - threshold), 'and',
str(baseline + threshold)
]))
assert False, ' '.join([
score, 'is not between',
str(baseline - threshold), 'and',
str(baseline + threshold)
])
def find_csv_files(directory):
csv_files = []
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith('.csv') and file.startswith('summary'):
csv_files.append(os.path.join(root, file))
csv_files_with_time = {f: os.path.getctime(f) for f in csv_files}
sorted_csv_files = sorted(csv_files_with_time.items(), key=lambda x: x[1])
latest_csv_file = sorted_csv_files[-1][0]
return latest_csv_file
def read_csv_file(file_path):
with open(file_path, 'r') as csvfile:
reader = csv.DictReader(csvfile)
filtered_data = []
for row in reader:
if row['metric'] is not None and 'bpb' not in row[
'metric'] and '_' != row['metric']:
filtered_row = row
filtered_row['dataset'] = row['dataset'] + '_' + row['metric']
del filtered_row['version']
del filtered_row['metric']
del filtered_row['mode']
filtered_data.append(filtered_row)
result = {}
for data in filtered_data:
dataset = data.get('dataset')
for key in data.keys():
if key == 'dataset':
continue
else:
if key in result.keys():
result.get(key)[dataset] = data.get(key)
else:
result[key] = {dataset: data.get(key)}
return result

39
.github/scripts/oc_score_baseline.yaml vendored Normal file
View File

@ -0,0 +1,39 @@
internlm2_5-7b-hf:
demo_gsm8k_accuracy: 42.19
race-middle_accuracy: 91.78
race-high_accuracy: 90.02
internlm2_5-7b_hf:
demo_gsm8k_accuracy: 42.19
race-middle_accuracy: 91.78
race-high_accuracy: 90.02
internlm2_5-7b-chat-lmdeploy:
demo_gsm8k_accuracy: 84.38
race-middle_accuracy: 92.76
race-high_accuracy: 90.54
internlm3-8b-instruct-lmdeploy:
demo_gsm8k_accuracy: 73.44
race-middle_accuracy: 93.38
race-high_accuracy: 90.34
internlm3-8b-instruct_hf-lmdeploy:
demo_gsm8k_accuracy: 73.44
race-middle_accuracy: 93.38
race-high_accuracy: 90.34
internlm3-8b-instruct_hf-vllm:
demo_gsm8k_accuracy: 78.12
race-middle_accuracy: 92.20
race-high_accuracy: 89.88
internlm2_5-7b-chat_hf:
demo_gsm8k_accuracy: 87.50
race-middle_accuracy: 92.76
race-high_accuracy: 90.48
lmdeploy-api-test:
gsm8k_accuracy: 68.75
race-middle_accuracy: 93.75
race-high_accuracy: 93.75

View File

@ -0,0 +1,983 @@
internlm2_5-7b-chat-hf_fullbench:
objective:
race-high_accuracy: 93.75
ARC-c_accuracy: 93.75
BoolQ_accuracy: 81.25
triviaqa_wiki_1shot_score: 50
nq_open_1shot_score: 25
IFEval_Prompt-level-strict-accuracy: 50
drop_accuracy: 81.25
GPQA_diamond_accuracy: 25
hellaswag_accuracy: 87.5
TheoremQA_score: 12.50
musr_average_naive_average: 39.58
korbench_single_naive_average: 40
gsm8k_accuracy: 62.50
math_accuracy: 75
cmo_fib_accuracy: 6.25
aime2024_accuracy: 6.25
wikibench-wiki-single_choice_cncircular_perf_4: 50
sanitized_mbpp_score: 68.75
ds1000_naive_average: 16.96
lcb_code_generation_pass@1: 12.5
lcb_code_execution_pass@1: 43.75
lcb_test_output_pass@1: 18.75
bbh-logical_deduction_seven_objects_score: 50
bbh-multistep_arithmetic_two_score: 68.75
mmlu-other_accuracy: 72.6
cmmlu-china-specific_accuracy: 76.25
mmlu_pro_math_accuracy: 25
ds1000_Pandas_accuracy: 12.5
ds1000_Numpy_accuracy: 0
ds1000_Tensorflow_accuracy: 12.5
ds1000_Scipy_accuracy: 18.75
ds1000_Sklearn_accuracy: 18.75
ds1000_Pytorch_accuracy: 12.5
ds1000_Matplotlib_accuracy: 43.75
openai_mmmlu_lite_AR-XY_accuracy: 37.5
college_naive_average: 12.5
college_knowledge_naive_average: 87.5
subjective:
alignment_bench_v1_1_总分: 0.66
alpaca_eval_total: 20.00
arenahard_score: 56.82
Followbench_naive_average: 1
CompassArena_naive_average: 43
mtbench101_avg: 7.60
wildbench_average: -14.58
simpleqa_accuracy_given_attempted: 1.00
chinese_simpleqa_given_attempted_accuracy: 0.90
alignment_bench_v1_1_专业能力: 7.90
alignment_bench_v1_1_数学计算: 0
alignment_bench_v1_1_基本任务: 0
alignment_bench_v1_1_逻辑推理: 0
alignment_bench_v1_1_中文理解: 0
alignment_bench_v1_1_文本写作: 0
alignment_bench_v1_1_角色扮演: 0
alignment_bench_v1_1_综合问答: 0
alpaca_eval_helpful_base: 20.00
compassarena_language_naive_average: 35
compassarena_knowledge_naive_average: 60.00
compassarena_reason_v2_naive_average: 40
compassarena_math_v2_naive_average: 50.00
compassarena_creationv2_zh_naive_average: 30
followbench_llmeval_en_HSR_AVG: 1
followbench_llmeval_en_SSR_AVG: 1
followbench_llmeval_en_HSR_L1: 1
followbench_llmeval_en_HSR_L2: 1
followbench_llmeval_en_HSR_L3: 1
followbench_llmeval_en_HSR_L4: 1
followbench_llmeval_en_HSR_L5: 1
followbench_llmeval_en_SSR_L1: 1
followbench_llmeval_en_SSR_L2: 1
followbench_llmeval_en_SSR_L3: 1
followbench_llmeval_en_SSR_L4: 1
followbench_llmeval_en_SSR_L5: 1
simpleqa_f1: 0.12
internlm2_5-7b-chat-turbomind_fullbench:
objective:
race-high_accuracy: 93.75
ARC-c_accuracy: 93.75
BoolQ_accuracy: 75.00
triviaqa_wiki_1shot_score: 50
nq_open_1shot_score: 25
IFEval_Prompt-level-strict-accuracy: 56.25
drop_accuracy: 75
GPQA_diamond_accuracy: 37.50
hellaswag_accuracy: 81.25
TheoremQA_score: 12.5
musr_average_naive_average: 39.58
korbench_single_naive_average: 40
gsm8k_accuracy: 68.75
math_accuracy: 68.75
cmo_fib_accuracy: 6.25
aime2024_accuracy: 6.25
wikibench-wiki-single_choice_cncircular_perf_4: 25
sanitized_mbpp_score: 68.75
ds1000_naive_average: 15.18
lcb_code_generation_pass@1: 12.5
lcb_code_execution_pass@1: 43.75
lcb_test_output_pass@1: 0.00
bbh-logical_deduction_seven_objects_score: 62.50
bbh-multistep_arithmetic_two_score: 62.50
mmlu-other_accuracy: 73.08
cmmlu-china-specific_accuracy: 75.42
mmlu_pro_math_accuracy: 25.00
ds1000_Pandas_accuracy: 0.00
ds1000_Numpy_accuracy: 0
ds1000_Tensorflow_accuracy: 12.5
ds1000_Scipy_accuracy: 18.75
ds1000_Sklearn_accuracy: 18.75
ds1000_Pytorch_accuracy: 12.50
ds1000_Matplotlib_accuracy: 43.75
openai_mmmlu_lite_AR-XY_accuracy: 37.5
college_naive_average: 12.50
college_knowledge_naive_average: 87.5
subjective:
alignment_bench_v1_1_总分: 0.72
alpaca_eval_total: 20.00
arenahard_score: 55.77
Followbench_naive_average: 1
CompassArena_naive_average: 39.00
mtbench101_avg: 7.90
wildbench_average: 0.00
simpleqa_accuracy_given_attempted: 1.00
chinese_simpleqa_given_attempted_accuracy: 1
alignment_bench_v1_1_专业能力: 8.70
alignment_bench_v1_1_数学计算: 0
alignment_bench_v1_1_基本任务: 0
alignment_bench_v1_1_逻辑推理: 0
alignment_bench_v1_1_中文理解: 0
alignment_bench_v1_1_文本写作: 0
alignment_bench_v1_1_角色扮演: 0
alignment_bench_v1_1_综合问答: 0
alpaca_eval_helpful_base: 20.00
compassarena_language_naive_average: 25.00
compassarena_knowledge_naive_average: 55.00
compassarena_reason_v2_naive_average: 35.00
compassarena_math_v2_naive_average: 55.00
compassarena_creationv2_zh_naive_average: 25.00
followbench_llmeval_en_HSR_AVG: 1
followbench_llmeval_en_SSR_AVG: 1
followbench_llmeval_en_HSR_L1: 1
followbench_llmeval_en_HSR_L2: 1
followbench_llmeval_en_HSR_L3: 1
followbench_llmeval_en_HSR_L4: 1
followbench_llmeval_en_HSR_L5: 1
followbench_llmeval_en_SSR_L1: 1
followbench_llmeval_en_SSR_L2: 1
followbench_llmeval_en_SSR_L3: 1
followbench_llmeval_en_SSR_L4: 1
followbench_llmeval_en_SSR_L5: 1
simpleqa_f1: 0.12
internlm2_5-7b-hf_fullbench:
objective:
race-high_accuracy: 100
ARC-c_accuracy: 68.75
BoolQ_accuracy: 87.5
triviaqa_wiki_1shot_score: 43.75
nq_open_1shot_score: 43.75
drop_accuracy: 62.5
GPQA_diamond_accuracy: 62.5
hellaswag_accuracy: 93.75
TheoremQA_score: 18.75
winogrande_accuracy: 75
gsm8k_accuracy: 37.5
GaokaoBench_2010-2022_Math_II_MCQs_score: 62.5
GaokaoBench_2010-2022_Math_II_Fill-in-the-Blank_score: 0
math_accuracy: 12.5
wikibench-wiki-single_choice_cncircular_perf_4: 25
sanitized_mbpp_score: 56.25
dingo_en_192_score: 37.5
dingo_zh_170_score: 100
mmlu-other_accuracy: 76.92
cmmlu-china-specific_accuracy: 84.17
mmlu_pro_math_accuracy: 18.75
bbh-logical_deduction_seven_objects_score: 43.75
bbh-multistep_arithmetic_two_score: 56.25
college_naive_average: 12.5
college_knowledge_naive_average: 87.5
internlm2_5-7b-turbomind_fullbench:
objective:
race-high_accuracy: 100
ARC-c_accuracy: 68.75
BoolQ_accuracy: 87.5
triviaqa_wiki_1shot_score: 43.75
nq_open_1shot_score: 43.75
drop_accuracy: 62.5
GPQA_diamond_accuracy: 68.75
hellaswag_accuracy: 93.75
TheoremQA_score: 18.75
winogrande_accuracy: 87.5
gsm8k_accuracy: 62.50
GaokaoBench_2010-2022_Math_II_MCQs_score: 93.75
GaokaoBench_2010-2022_Math_II_Fill-in-the-Blank_score: 0
math_accuracy: 6.25
wikibench-wiki-single_choice_cncircular_perf_4: 0.00
sanitized_mbpp_score: 62.50
dingo_en_192_score: 37.50
dingo_zh_170_score: 100.00
mmlu-other_accuracy: 78.37
cmmlu-china-specific_accuracy: 83.33
mmlu_pro_math_accuracy: 18.75
bbh-logical_deduction_seven_objects_score: 62.50
bbh-multistep_arithmetic_two_score: 50.00
college_naive_average: 12.5
college_knowledge_naive_average: 87.5
internlm2_5-7b-turbomind:
objective:
race-high_accuracy: 89.28
ARC-c_accuracy: 52.2
BoolQ_accuracy: 89.72
triviaqa_wiki_1shot_score: 65.88
nq_open_1shot_score: 34.82
drop_accuracy: 68.1
bbh_naive_average: 72.15
GPQA_diamond_accuracy: 32.83
hellaswag_accuracy: 88.36
TheoremQA_score: 25
winogrande_accuracy: 81.29
gsm8k_accuracy: 74.68
GaokaoBench_weighted_average: 58.19
math_accuracy: 33.98
Mathbench_naive_average: 48.38
wikibench-wiki-single_choice_cncircular_perf_4: 29.1
cmmlu_naive_average: 78.94
mmlu_naive_average: 71.44
mmlu_pro_naive_average: 38.18
openai_humaneval_humaneval_pass@1: 59.76
openai_humaneval_v2_humaneval_pass@1: 57.93
sanitized_mbpp_score: 55.25
dingo_en_192_score: 60.94
dingo_zh_170_score: 67.65
mmlu-stem_accuracy: 63.72
mmlu-social-science_accuracy: 80.15
mmlu-humanities_accuracy: 74.27
mmlu-other_accuracy: 71.85
cmmlu-stem_accuracy: 67.07
cmmlu-social-science_accuracy: 81.49
cmmlu-humanities_accuracy: 85.84
cmmlu-other_accuracy: 82.69
cmmlu-china-specific_accuracy: 79.88
mmlu_pro_biology_accuracy: 58.58
mmlu_pro_business_accuracy: 28.01
mmlu_pro_chemistry_accuracy: 22.79
mmlu_pro_computer_science_accuracy: 39.02
mmlu_pro_economics_accuracy: 53.08
mmlu_pro_engineering_accuracy: 25.7
mmlu_pro_health_accuracy: 46.94
mmlu_pro_history_accuracy: 43.04
mmlu_pro_law_accuracy: 29.7
mmlu_pro_math_accuracy: 24.2
mmlu_pro_philosophy_accuracy: 42.48
mmlu_pro_physics_accuracy: 26.02
mmlu_pro_psychology_accuracy: 52.76
mmlu_pro_other_accuracy: 42.21
college_naive_average: 7.00
high_naive_average: 6.67
middle_naive_average: 26.67
primary_naive_average: 64.00
arithmetic_naive_average: 55
mathbench-a (average)_naive_average: 31.8
college_knowledge_naive_average: 58.23
high_knowledge_naive_average: 52.51
middle_knowledge_naive_average: 71.15
primary_knowledge_naive_average: 60.48
mathbench-t (average)_naive_average: 60.19
long_context:
Single-Needle-Retrieval(S-RT)-32000_naive_average: 100
Single-Needle-Retrieval-EN-32000_naive_average: 100
Single-Needle-Retrieval-ZH-32000_naive_average: 100
Single-Needle-Retrieval(S-RT)-100000_naive_average: 100
Single-Needle-Retrieval-EN-100000_naive_average: 100
Single-Needle-Retrieval-ZH-100000_naive_average: 100
Single-Needle-Retrieval(S-RT)-200000_naive_average: 100
Single-Needle-Retrieval-EN-200000_naive_average: 100
Single-Needle-Retrieval-ZH-200000_naive_average: 100
longbench_naive_average: 46.19
longbench_zh_naive_average: 49.3
longbench_en_naive_average: 43.97
longbench_single-document-qa_score: 42.84
longbench_multi-document-qa_score: 41.25
longbench_summarization_score: 23.21
longbench_few-shot-learning_score: 61.67
longbench_synthetic-tasks_score: 60.05
longbench_code-completion_score: 52.09
internlm2_5-7b-chat-turbomind:
objective:
race-high_accuracy: 86.16
ARC-c_accuracy: 90.17
BoolQ_accuracy: 87.89
triviaqa_wiki_1shot_score: 64.91
nq_open_1shot_score: 22.69
mmmlu_lite_naive_average: 44.96
IFEval_Prompt-level-strict-accuracy: 58.04
drop_accuracy: 77.68
bbh_naive_average: 73.14
GPQA_diamond_accuracy: 31.06
hellaswag_accuracy: 94.79
TheoremQA_score: 22.25
musr_average_naive_average: 50.89
korbench_single_naive_average: 32.16
ARC_Prize_Public_Evaluation_accuracy: 0.02
gsm8k_accuracy: 86.73
GaokaoBench_weighted_average: 78.6
math_accuracy: 61
cmo_fib_accuracy: 11
aime2024_accuracy: 3.33
Mathbench_naive_average: 64.23
wikibench-wiki-single_choice_cncircular_perf_4: 31.32
cmmlu_naive_average: 74.3
mmlu_naive_average: 70.84
mmlu_pro_naive_average: 44.98
openai_humaneval_humaneval_pass@1: 69.8
sanitized_mbpp_score: 64.4
humanevalx_naive_average: 33.35
ds1000_naive_average: 14.15
lcb_code_generation_pass@1: 17.75
lcb_code_execution_pass@1: 32.57
lcb_test_output_pass@1: 26.13
bigcodebench_hard_instruct_pass@1: 3.38
bigcodebench_hard_complete_pass@1: 5.06
teval_naive_average: 80
SciCode_sub_accuracy: 5.56
qa_dingo_cn_score: 99.01
mmlu-stem_accuracy: 68.2
mmlu-social-science_accuracy: 75.8
mmlu-humanities_accuracy: 69.3
mmlu-other_accuracy: 71.3
cmmlu-stem_accuracy: 66.64
cmmlu-social-science_accuracy: 76
cmmlu-humanities_accuracy: 77.9
cmmlu-other_accuracy: 77.25
cmmlu-china-specific_accuracy: 73.6
mmlu_pro_biology_accuracy: 66.67
mmlu_pro_business_accuracy: 47.91
mmlu_pro_chemistry_accuracy: 35
mmlu_pro_computer_science_accuracy: 48.9
mmlu_pro_economics_accuracy: 55.87
mmlu_pro_engineering_accuracy: 29.62
mmlu_pro_health_accuracy: 45
mmlu_pro_history_accuracy: 40.8
mmlu_pro_law_accuracy: 25.79
mmlu_pro_math_accuracy: 53.48
mmlu_pro_philosophy_accuracy: 38.38
mmlu_pro_physics_accuracy: 37.79
mmlu_pro_psychology_accuracy: 58.39
mmlu_pro_other_accuracy: 46.27
humanevalx-python_pass@1: 53.66
humanevalx-cpp_pass@1: 22.56
humanevalx-go_pass@1: 0
humanevalx-js_pass@1: 54.88
ds1000_Pandas_accuracy: 10.65
ds1000_Numpy_accuracy: 3.63
ds1000_Tensorflow_accuracy: 13.33
ds1000_Scipy_accuracy: 8.96
ds1000_Sklearn_accuracy: 6.96
ds1000_Pytorch_accuracy: 6.62
ds1000_Matplotlib_accuracy: 49.35
openai_mmmlu_lite_AR-XY_accuracy: 17.19
openai_mmmlu_lite_BN-BD_accuracy: 26.78
openai_mmmlu_lite_DE-DE_accuracy: 51.27
openai_mmmlu_lite_ES-LA_accuracy: 56.94
openai_mmmlu_lite_FR-FR_accuracy: 58.22
openai_mmmlu_lite_HI-IN_accuracy: 30.75
openai_mmmlu_lite_ID-ID_accuracy: 50.6
openai_mmmlu_lite_IT-IT_accuracy: 50.6
openai_mmmlu_lite_JA-JP_accuracy: 51.13
openai_mmmlu_lite_KO-KR_accuracy: 45
openai_mmmlu_lite_PT-BR_accuracy: 57.68
openai_mmmlu_lite_SW-KE_accuracy: 32.56
openai_mmmlu_lite_YO-NG_accuracy: 32.42
openai_mmmlu_lite_ZH-CN_accuracy: 65.4
college_naive_average: 19.17
high_naive_average: 46.5
middle_naive_average: 61.34
primary_naive_average: 73.34
arithmetic_naive_average: 61.67
mathbench-a (average)_naive_average: 52.58
college_knowledge_naive_average: 67.1
high_knowledge_naive_average: 70
middle_knowledge_naive_average: 80
primary_knowledge_naive_average: 90.12
mathbench-t (average)_naive_average: 76
subjective:
alignment_bench_v1_1_总分: 5.68
alpaca_eval_total: 25.96
arenahard_score: 17.15
Followbench_naive_average: 0.81
CompassArena_naive_average: 39.49
FoFo_naive_average: 0.38
mtbench101_avg: 8.01
wildbench_average: -10.49
simpleqa_accuracy_given_attempted: 0.04
chinese_simpleqa_given_attempted_accuracy: 0.34
alignment_bench_v1_1_专业能力: 6.05
alignment_bench_v1_1_数学计算: 5.87
alignment_bench_v1_1_基本任务: 6.01
alignment_bench_v1_1_逻辑推理: 4.48
alignment_bench_v1_1_中文理解: 6.17
alignment_bench_v1_1_文本写作: 6.06
alignment_bench_v1_1_角色扮演: 6.3
alignment_bench_v1_1_综合问答: 6.45
alpaca_eval_helpful_base: 17.83
alpaca_eval_koala: 28.21
alpaca_eval_oasst: 23.4
alpaca_eval_selfinstruct: 30.95
alpaca_eval_vicuna: 25.00
compassarena_language_naive_average: 53.00
compassarena_knowledge_naive_average: 36
compassarena_reason_v2_naive_average: 35
compassarena_math_v2_naive_average: 16.07
compassarena_creationv2_zh_naive_average: 43.64
fofo_test_prompts_overall: 0.35
fofo_test_prompts_cn_overall: 0.41
followbench_llmeval_en_HSR_AVG: 0.73
followbench_llmeval_en_SSR_AVG: 0.88
followbench_llmeval_en_HSR_L1: 0.94
followbench_llmeval_en_HSR_L2: 0.77
followbench_llmeval_en_HSR_L3: 0.73
followbench_llmeval_en_HSR_L4: 0.68
followbench_llmeval_en_HSR_L5: 0.54
followbench_llmeval_en_SSR_L1: 0.94
followbench_llmeval_en_SSR_L2: 0.88
followbench_llmeval_en_SSR_L3: 0.87
followbench_llmeval_en_SSR_L4: 0.87
followbench_llmeval_en_SSR_L5: 0.85
simpleqa_f1: 0.04
internlm2_5-7b-chat-1m-turbomind:
long_context:
ruler_8k_naive_average: 88.53
ruler_32k_naive_average: 83.84
ruler_128k_naive_average: 70.94
NeedleBench-Overall-Score-8K_weighted_average: 91.89
NeedleBench-Overall-Score-32K_weighted_average: 91.42
NeedleBench-Overall-Score-128K_weighted_average: 88.57
longbench_naive_average: 46.44
longbench_zh_naive_average: 45.19
longbench_en_naive_average: 45.71
babilong_0k_naive_average: 79.3
babilong_4k_naive_average: 67
babilong_16k_naive_average: 52.7
babilong_32k_naive_average: 48.9
babilong_128k_naive_average: 40.8
babilong_256k_naive_average: 23.5
longbench_single-document-qa_score: 43.56
longbench_multi-document-qa_score: 46.24
longbench_summarization_score: 24.32
longbench_few-shot-learning_score: 51.67
longbench_synthetic-tasks_score: 66.83
longbench_code-completion_score: 45.99
qwen2.5-7b-instruct-turbomind:
objective:
race-high_accuracy: 84.99
ARC-c_accuracy: 92.2
BoolQ_accuracy: 86.7
triviaqa_wiki_1shot_score: 53.06
nq_open_1shot_score: 17.51
mmmlu_lite_naive_average: 54.96
IFEval_Prompt-level-strict-accuracy: 71.53
drop_accuracy: 80.07
bbh_naive_average: 68.81
GPQA_diamond_accuracy: 34.34
hellaswag_accuracy: 85.42
TheoremQA_score: 18.38
musr_average_naive_average: 43.44
korbench_single_naive_average: 39.44
ARC_Prize_Public_Evaluation_accuracy: 0
gsm8k_accuracy: 92.57
GaokaoBench_weighted_average: 80.14
math_accuracy: 73.58
cmo_fib_accuracy: 25
aime2024_accuracy: 16.67
Mathbench_naive_average: 77.33
wikibench-wiki-single_choice_cncircular_perf_4: 34.9
cmmlu_naive_average: 75.97
mmlu_naive_average: 76.01
mmlu_pro_naive_average: 56.12
openai_humaneval_humaneval_pass@1: 83.54
sanitized_mbpp_score: 74.71
humanevalx_naive_average: 48.29
ds1000_naive_average: 18.66
lcb_code_generation_pass@1: 39.5
lcb_code_execution_pass@1: 42.38
lcb_test_output_pass@1: 50.68
bigcodebench_hard_instruct_pass@1: 16.22
bigcodebench_hard_complete_pass@1: 11.49
teval_naive_average: 79.72
SciCode_sub_accuracy: 10.76
qa_dingo_cn_score: 99.01
mmlu_accuracy: 76.01
mmlu-stem_accuracy: 77.59
mmlu-social-science_accuracy: 79.02
mmlu-humanities_accuracy: 72.07
mmlu-other_accuracy: 74.86
cmmlu_accuracy: 75.97
cmmlu-stem_accuracy: 73.09
cmmlu-social-science_accuracy: 75.95
cmmlu-humanities_accuracy: 76.53
cmmlu-other_accuracy: 78.79
cmmlu-china-specific_accuracy: 73.17
mmlu_pro_accuracy: 56.12
mmlu_pro_biology_accuracy: 71.41
mmlu_pro_business_accuracy: 67.68
mmlu_pro_chemistry_accuracy: 54.59
mmlu_pro_computer_science_accuracy: 58.29
mmlu_pro_economics_accuracy: 66.82
mmlu_pro_engineering_accuracy: 42.41
mmlu_pro_health_accuracy: 55.87
mmlu_pro_history_accuracy: 46.46
mmlu_pro_law_accuracy: 28.97
mmlu_pro_math_accuracy: 73.13
mmlu_pro_philosophy_accuracy: 44.89
mmlu_pro_physics_accuracy: 58.43
mmlu_pro_psychology_accuracy: 63.16
mmlu_pro_other_accuracy: 53.57
humanevalx-python_pass@1: 50
humanevalx-cpp_pass@1: 42.07
humanevalx-go_pass@1: 0
humanevalx-java_pass@1: 53.05
humanevalx-js_pass@1: 75
ds1000_Pandas_accuracy: 14.09
ds1000_Numpy_accuracy: 8.18
ds1000_Tensorflow_accuracy: 17.78
ds1000_Scipy_accuracy: 15.09
ds1000_Sklearn_accuracy: 10.43
ds1000_Pytorch_accuracy: 4.41
ds1000_Matplotlib_accuracy: 60.65
mmmlu_lite_accuracy: 54.96
openai_mmmlu_lite_AR-XY_accuracy: 42.32
openai_mmmlu_lite_BN-BD_accuracy: 42.25
openai_mmmlu_lite_DE-DE_accuracy: 59.93
openai_mmmlu_lite_ES-LA_accuracy: 66.53
openai_mmmlu_lite_FR-FR_accuracy: 66.88
openai_mmmlu_lite_HI-IN_accuracy: 49.26
openai_mmmlu_lite_ID-ID_accuracy: 61.26
openai_mmmlu_lite_IT-IT_accuracy: 65.47
openai_mmmlu_lite_JA-JP_accuracy: 61.54
openai_mmmlu_lite_KO-KR_accuracy: 60.28
openai_mmmlu_lite_PT-BR_accuracy: 55.51
openai_mmmlu_lite_SW-KE_accuracy: 36.42
openai_mmmlu_lite_YO-NG_accuracy: 32.14
openai_mmmlu_lite_ZH-CN_accuracy: 69.61
college_naive_average: 44.33
high_naive_average: 59
middle_naive_average: 78
primary_naive_average: 85.67
arithmetic_naive_average: 75.67
mathbench-a (average)_naive_average: 69.27
college_knowledge_naive_average: 83.86
high_knowledge_naive_average: 80.29
middle_knowledge_naive_average: 84.26
primary_knowledge_naive_average: 93.16
mathbench-t (average)_naive_average: 85.39
internlm2_5-7b-chat-pytorch:
objective:
race-high_accuracy: 86.39
ARC-c_accuracy: 90.51
BoolQ_accuracy: 88.01
triviaqa_wiki_1shot_score: 64.77
nq_open_1shot_score: 22.71
mmmlu_lite_naive_average: 45.02
IFEval_Prompt-level-strict-accuracy: 56.56
drop_accuracy: 75.46
bbh_naive_average: 73.34
GPQA_diamond_accuracy: 32.83
hellaswag_accuracy: 94.81
TheoremQA_score: 23.88
musr_average_naive_average: 51.31
korbench_single_naive_average: 32
ARC_Prize_Public_Evaluation_accuracy: 0.01
gsm8k_accuracy: 86.96
GaokaoBench_weighted_average: 78.05
math_accuracy: 60.34
cmo_fib_accuracy: 12.98
aime2024_accuracy: 3.33
Mathbench_naive_average: 64.82
wikibench-wiki-single_choice_cncircular_perf_4: 31.7
cmmlu_naive_average: 74.24
mmlu_naive_average: 70.2
mmlu_pro_naive_average: 45.39
openai_humaneval_humaneval_pass@1: 70.12
sanitized_mbpp_score: 64.59
humanevalx_naive_average: 38.78
ds1000_naive_average: 14.19
lcb_code_generation_pass@1: 16.5
lcb_code_execution_pass@1: 33.82
lcb_test_output_pass@1: 22.62
bigcodebench_hard_instruct_pass@1: 6.08
bigcodebench_hard_complete_pass@1: 6.76
teval_naive_average: 79.73
SciCode_sub_accuracy: 3.47
qa_dingo_cn_score: 100
mmlu_accuracy: 70.2
mmlu-stem_accuracy: 67.73
mmlu-social-science_accuracy: 75.49
mmlu-humanities_accuracy: 68.56
mmlu-other_accuracy: 70.58
cmmlu_accuracy: 74.24
cmmlu-stem_accuracy: 66.7
cmmlu-social-science_accuracy: 75.88
cmmlu-humanities_accuracy: 77.56
cmmlu-other_accuracy: 77.52
cmmlu-china-specific_accuracy: 73.46
mmlu_pro_accuracy: 45.39
mmlu_pro_biology_accuracy: 65.83
mmlu_pro_business_accuracy: 51.96
mmlu_pro_chemistry_accuracy: 36.84
mmlu_pro_computer_science_accuracy: 48.29
mmlu_pro_economics_accuracy: 56.16
mmlu_pro_engineering_accuracy: 29.1
mmlu_pro_health_accuracy: 44.5
mmlu_pro_history_accuracy: 42.26
mmlu_pro_law_accuracy: 24.98
mmlu_pro_math_accuracy: 54.85
mmlu_pro_philosophy_accuracy: 39.28
mmlu_pro_physics_accuracy: 37.41
mmlu_pro_psychology_accuracy: 58.27
mmlu_pro_other_accuracy: 45.78
humanevalx-python_pass@1: 56.1
humanevalx-cpp_pass@1: 20.73
humanevalx-go_pass@1: 0
humanevalx-java_pass@1: 59.15
humanevalx-js_pass@1: 57.93
ds1000_Pandas_accuracy: 8.93
ds1000_Numpy_accuracy: 4.09
ds1000_Tensorflow_accuracy: 11.11
ds1000_Scipy_accuracy: 7.55
ds1000_Sklearn_accuracy: 7.83
ds1000_Pytorch_accuracy: 8.82
ds1000_Matplotlib_accuracy: 50.97
mmmlu_lite_accuracy: 45.02
openai_mmmlu_lite_AR-XY_accuracy: 18.6
openai_mmmlu_lite_BN-BD_accuracy: 27.58
openai_mmmlu_lite_DE-DE_accuracy: 51.23
openai_mmmlu_lite_ES-LA_accuracy: 56.63
openai_mmmlu_lite_FR-FR_accuracy: 58.11
openai_mmmlu_lite_HI-IN_accuracy: 33.82
openai_mmmlu_lite_ID-ID_accuracy: 50.39
openai_mmmlu_lite_IT-IT_accuracy: 50.39
openai_mmmlu_lite_JA-JP_accuracy: 50.95
openai_mmmlu_lite_KO-KR_accuracy: 45.05
openai_mmmlu_lite_PT-BR_accuracy: 57.89
openai_mmmlu_lite_SW-KE_accuracy: 32.14
openai_mmmlu_lite_YO-NG_accuracy: 32.14
openai_mmmlu_lite_ZH-CN_accuracy: 65.33
college_naive_average: 21
high_naive_average: 47
middle_naive_average: 59.67
primary_naive_average: 72.33
arithmetic_naive_average: 62
mathbench-a (average)_naive_average: 53.13
college_knowledge_naive_average: 68.99
high_knowledge_naive_average: 70.06
middle_knowledge_naive_average: 78.53
primary_knowledge_naive_average: 88.49
mathbench-t (average)_naive_average: 76.51
qwen2.5-7b-instruct-pytorch:
objective:
race-high_accuracy: 85.16
ARC-c_accuracy: 90.85
BoolQ_accuracy: 86.61
triviaqa_wiki_1shot_score: 52.96
nq_open_1shot_score: 17.62
mmmlu_lite_naive_average: 54.7
IFEval_Prompt-level-strict-accuracy: 71.35
drop_accuracy: 80.23
bbh_naive_average: 68.88
GPQA_diamond_accuracy: 36.36
hellaswag_accuracy: 85.49
TheoremQA_score: 18.38
musr_average_naive_average: 43.3
korbench_single_naive_average: 39.44
ARC_Prize_Public_Evaluation_accuracy: 0
gsm8k_accuracy: 91.66
GaokaoBench_weighted_average: 80.02
math_accuracy: 73.74
cmo_fib_accuracy: 22.60
aime2024_accuracy: 13.33
Mathbench_naive_average: 77.08
wikibench-wiki-single_choice_cncircular_perf_4: 34
cmmlu_naive_average: 75.9
mmlu_naive_average: 76.27
mmlu_pro_naive_average: 56.14
openai_humaneval_humaneval_pass@1: 84.76
sanitized_mbpp_score: 74.71
humanevalx_naive_average: 48.17
ds1000_naive_average: 18.57
lcb_code_generation_pass@1: 38.75
lcb_code_execution_pass@1: 42.38
lcb_test_output_pass@1: 50.45
bigcodebench_hard_instruct_pass@1: 16.89
bigcodebench_hard_complete_pass@1: 12.16
teval_naive_average: 79.46
SciCode_sub_accuracy: 10.42
qa_dingo_cn_score: 100
mmlu_accuracy: 76.27
mmlu-stem_accuracy: 77.75
mmlu-social-science_accuracy: 78.65
mmlu-humanities_accuracy: 73.12
mmlu-other_accuracy: 75.05
cmmlu_accuracy: 75.9
cmmlu-stem_accuracy: 73.41
cmmlu-social-science_accuracy: 75.97
cmmlu-humanities_accuracy: 76.42
cmmlu-other_accuracy: 78.15
cmmlu-china-specific_accuracy: 73.27
mmlu_pro_accuracy: 56.14
mmlu_pro_biology_accuracy: 72.25
mmlu_pro_business_accuracy: 66.16
mmlu_pro_chemistry_accuracy: 55.65
mmlu_pro_computer_science_accuracy: 60.24
mmlu_pro_economics_accuracy: 66.82
mmlu_pro_engineering_accuracy: 41.38
mmlu_pro_health_accuracy: 54.89
mmlu_pro_history_accuracy: 46.46
mmlu_pro_law_accuracy: 29.06
mmlu_pro_math_accuracy: 73.58
mmlu_pro_philosophy_accuracy: 44.89
mmlu_pro_physics_accuracy: 60.05
mmlu_pro_psychology_accuracy: 61.9
mmlu_pro_other_accuracy: 52.6
humanevalx-python_pass@1: 51.83
humanevalx-cpp_pass@1: 42.68
humanevalx-go_pass@1: 0
humanevalx-java_pass@1: 73.78
humanevalx-js_pass@1: 72.56
ds1000_Pandas_accuracy: 14.09
ds1000_Numpy_accuracy: 8.64
ds1000_Tensorflow_accuracy: 17.78
ds1000_Scipy_accuracy: 15.09
ds1000_Sklearn_accuracy: 8.7
ds1000_Pytorch_accuracy: 4.41
ds1000_Matplotlib_accuracy: 61.29
mmmlu_lite_accuracy: 54.7
openai_mmmlu_lite_AR-XY_accuracy: 42.32
openai_mmmlu_lite_BN-BD_accuracy: 42.18
openai_mmmlu_lite_DE-DE_accuracy: 60
openai_mmmlu_lite_ES-LA_accuracy: 66.18
openai_mmmlu_lite_FR-FR_accuracy: 66.88
openai_mmmlu_lite_HI-IN_accuracy: 48.63
openai_mmmlu_lite_ID-ID_accuracy: 61.26
openai_mmmlu_lite_IT-IT_accuracy: 65.26
openai_mmmlu_lite_JA-JP_accuracy: 60.7
openai_mmmlu_lite_KO-KR_accuracy: 60.63
openai_mmmlu_lite_PT-BR_accuracy: 54.46
openai_mmmlu_lite_SW-KE_accuracy: 36
openai_mmmlu_lite_YO-NG_accuracy: 31.86
openai_mmmlu_lite_ZH-CN_accuracy: 69.4
college_naive_average: 48.33
high_naive_average: 59.33
middle_naive_average: 76.67
primary_naive_average: 86.67
arithmetic_naive_average: 74.33
mathbench-a (average)_naive_average: 69.07
college_knowledge_naive_average: 83.54
high_knowledge_naive_average: 80.82
middle_knowledge_naive_average: 83.79
primary_knowledge_naive_average: 92.22
mathbench-t (average)_naive_average: 85.1
internlm3-8b-instruct-turbomind:
objective:
race-high_accuracy: 89.22
ARC-c_accuracy: 92.54
BoolQ_accuracy: 86.45
triviaqa_wiki_1shot_score: 60.72
nq_open_1shot_score: 20.25
mmmlu_lite_naive_average: 41.82
IFEval_Prompt-level-strict-accuracy: 77.45
drop_accuracy: 83.27
bbh_naive_average: 55.22
GPQA_diamond_accuracy: 37.88
hellaswag_accuracy: 91.28
TheoremQA_score: 20.12
musr_average_naive_average: 36.86
korbench_single_naive_average: 41.2
ARC_Prize_Public_Evaluation_accuracy: 0.06
gsm8k_accuracy: 91.28
GaokaoBench_weighted_average: 86.59
math_accuracy: 76.96
cmo_fib_accuracy: 38.46
aime2024_accuracy: 13.33
Mathbench_naive_average: 78.96
wikibench-wiki-single_choice_cncircular_perf_4: 37.45
cmmlu_naive_average: 83.33
mmlu_naive_average: 76.21
mmlu_pro_naive_average: 57.96
openai_humaneval_humaneval_pass@1: 81.71
sanitized_mbpp_score: 69.65
humanevalx_naive_average: 40.73
ds1000_naive_average: 27.23
lcb_code_generation_pass@1: 34.75
lcb_code_execution_pass@1: 49.9
lcb_test_output_pass@1: 48.19
bigcodebench_hard_instruct_pass@1: 13.51
bigcodebench_hard_complete_pass@1: 15.54
teval_naive_average: 82.86
SciCode_sub_accuracy: 11.11
qa_dingo_cn_score: 100
mmlu_accuracy: 76.21
mmlu-stem_accuracy: 77.7
mmlu-social-science_accuracy: 80.98
mmlu-humanities_accuracy: 70.83
mmlu-other_accuracy: 75.01
cmmlu_accuracy: 83.33
cmmlu-stem_accuracy: 79.66
cmmlu-social-science_accuracy: 83.39
cmmlu-humanities_accuracy: 84.73
cmmlu-other_accuracy: 86.2
cmmlu-china-specific_accuracy: 81.77
mmlu_pro_accuracy: 57.96
mmlu_pro_biology_accuracy: 75.45
mmlu_pro_business_accuracy: 64.64
mmlu_pro_chemistry_accuracy: 59.81
mmlu_pro_computer_science_accuracy: 60.24
mmlu_pro_economics_accuracy: 68.6
mmlu_pro_engineering_accuracy: 44.79
mmlu_pro_health_accuracy: 58.31
mmlu_pro_history_accuracy: 49.87
mmlu_pro_law_accuracy: 32.43
mmlu_pro_math_accuracy: 70.17
mmlu_pro_philosophy_accuracy: 46.89
mmlu_pro_physics_accuracy: 59.58
mmlu_pro_psychology_accuracy: 66.29
mmlu_pro_other_accuracy: 54.33
humanevalx-python_pass@1: 43.9
humanevalx-cpp_pass@1: 20.12
humanevalx-go_pass@1: 0
humanevalx-java_pass@1: 40.85
humanevalx-js_pass@1: 65.24
ds1000_Pandas_accuracy: 16.49
ds1000_Numpy_accuracy: 34.09
ds1000_Tensorflow_accuracy: 26.67
ds1000_Scipy_accuracy: 17.92
ds1000_Sklearn_accuracy: 20.87
ds1000_Pytorch_accuracy: 19.12
ds1000_Matplotlib_accuracy: 55.48
mmmlu_lite_accuracy: 41.82
openai_mmmlu_lite_AR-XY_accuracy: 32.56
openai_mmmlu_lite_BN-BD_accuracy: 4.56
openai_mmmlu_lite_DE-DE_accuracy: 24.91
openai_mmmlu_lite_ES-LA_accuracy: 51.09
openai_mmmlu_lite_FR-FR_accuracy: 61.68
openai_mmmlu_lite_HI-IN_accuracy: 24.98
openai_mmmlu_lite_ID-ID_accuracy: 44.56
openai_mmmlu_lite_IT-IT_accuracy: 52.35
openai_mmmlu_lite_JA-JP_accuracy: 51.02
openai_mmmlu_lite_KO-KR_accuracy: 47.93
openai_mmmlu_lite_PT-BR_accuracy: 53.89
openai_mmmlu_lite_SW-KE_accuracy: 33.47
openai_mmmlu_lite_YO-NG_accuracy: 33.47
openai_mmmlu_lite_ZH-CN_accuracy: 69.05
college_naive_average: 45.67
high_naive_average: 64.67
middle_naive_average: 82.33
primary_naive_average: 90.33
arithmetic_naive_average: 74
mathbench-a (average)_naive_average: 71.4
college_knowledge_naive_average: 85.28
high_knowledge_naive_average: 79.43
middle_knowledge_naive_average: 87.9
primary_knowledge_naive_average: 93.42
mathbench-t (average)_naive_average: 86.51
internlm3-8b-instruct-pytorch:
objective:
race-high_accuracy: 89.02
ARC-c_accuracy: 93.56
BoolQ_accuracy: 86.67
triviaqa_wiki_1shot_score: 60.54
nq_open_1shot_score: 20.3
mmmlu_lite_naive_average: 42.6
IFEval_Prompt-level-strict-accuracy: 79.11
drop_accuracy: 83.32
bbh_naive_average: 54.76
GPQA_diamond_accuracy: 33.84
hellaswag_accuracy: 91.31
TheoremQA_score: 18
musr_average_naive_average: 36.62
korbench_single_naive_average: 41.84
ARC_Prize_Public_Evaluation_accuracy: 0.06
gsm8k_accuracy: 90.67
GaokaoBench_weighted_average: 86.27
math_accuracy: 76.68
cmo_fib_accuracy: 33.65
aime2024_accuracy: 10
Mathbench_naive_average: 78.92
wikibench-wiki-single_choice_cncircular_perf_4: 37.35
cmmlu_naive_average: 83.11
mmlu_naive_average: 76.23
mmlu_pro_naive_average: 58.16
openai_humaneval_humaneval_pass@1: 82.32
sanitized_mbpp_score: 70.04
humanevalx_naive_average: 25.49
ds1000_naive_average: 27.84
lcb_code_generation_pass@1: 34.5
lcb_code_execution_pass@1: 48.02
lcb_test_output_pass@1: 47.74
bigcodebench_hard_instruct_pass@1: 12.84
bigcodebench_hard_complete_pass@1: 15.54
teval_naive_average: 82.86
SciCode_sub_accuracy: 9.38
qa_dingo_cn_score: 100
mmlu_accuracy: 76.23
mmlu-stem_accuracy: 78.08
mmlu-social-science_accuracy: 80.31
mmlu-humanities_accuracy: 71.38
mmlu-other_accuracy: 74.63
cmmlu_accuracy: 83.11
cmmlu-stem_accuracy: 79.42
cmmlu-social-science_accuracy: 83.34
cmmlu-humanities_accuracy: 83.95
cmmlu-other_accuracy: 86.22
cmmlu-china-specific_accuracy: 81.5
mmlu_pro_accuracy: 58.16
mmlu_pro_biology_accuracy: 74.62
mmlu_pro_business_accuracy: 65.02
mmlu_pro_chemistry_accuracy: 60.69
mmlu_pro_computer_science_accuracy: 61.46
mmlu_pro_economics_accuracy: 68.25
mmlu_pro_engineering_accuracy: 45.3
mmlu_pro_health_accuracy: 60.15
mmlu_pro_history_accuracy: 50.66
mmlu_pro_law_accuracy: 31.7
mmlu_pro_math_accuracy: 70.32
mmlu_pro_philosophy_accuracy: 47.7
mmlu_pro_physics_accuracy: 59.51
mmlu_pro_psychology_accuracy: 65.41
mmlu_pro_other_accuracy: 53.46
humanevalx-python_pass@1: 42.68
humanevalx-cpp_pass@1: 19.51
humanevalx-go_pass@1: 0
humanevalx-java_pass@1: 0.00
humanevalx-js_pass@1: 64.02
ds1000_Pandas_accuracy: 14.09
ds1000_Numpy_accuracy: 35
ds1000_Tensorflow_accuracy: 24.44
ds1000_Scipy_accuracy: 20.75
ds1000_Sklearn_accuracy: 21.74
ds1000_Pytorch_accuracy: 22.06
ds1000_Matplotlib_accuracy: 56.77
mmmlu_lite_accuracy: 42.6
openai_mmmlu_lite_AR-XY_accuracy: 32.84
openai_mmmlu_lite_BN-BD_accuracy: 10.46
openai_mmmlu_lite_DE-DE_accuracy: 24.56
openai_mmmlu_lite_ES-LA_accuracy: 50.95
openai_mmmlu_lite_FR-FR_accuracy: 61.05
openai_mmmlu_lite_HI-IN_accuracy: 30.6
openai_mmmlu_lite_ID-ID_accuracy: 45.89
openai_mmmlu_lite_IT-IT_accuracy: 51.79
openai_mmmlu_lite_JA-JP_accuracy: 51.65
openai_mmmlu_lite_KO-KR_accuracy: 48.77
openai_mmmlu_lite_PT-BR_accuracy: 52.7
openai_mmmlu_lite_SW-KE_accuracy: 32.91
openai_mmmlu_lite_YO-NG_accuracy: 32.84
openai_mmmlu_lite_ZH-CN_accuracy: 69.33
college_naive_average: 47
high_naive_average: 66.67
middle_naive_average: 81.67
primary_naive_average: 89.33
arithmetic_naive_average: 73.67
mathbench-a (average)_naive_average: 71.67
college_knowledge_naive_average: 82.91
high_knowledge_naive_average: 79.86
middle_knowledge_naive_average: 88.92
primary_knowledge_naive_average: 92.96
mathbench-t (average)_naive_average: 86.16

View File

@ -0,0 +1,432 @@
chat:
glm-4-9b-chat-hf:
gsm8k_accuracy: 56.25
race-high_accuracy: 84.38
glm-4-9b-chat-turbomind:
gsm8k_accuracy: 71.88
race-high_accuracy: 90.62
glm-4-9b-chat-vllm:
gsm8k_accuracy: 71.88
race-high_accuracy: 90.62
deepseek-7b-chat-hf:
gsm8k_accuracy: 46.88
race-high_accuracy: 81.25
deepseek-r1-distill-llama-8b-turbomind:
gsm8k_accuracy: 34.38
race-high_accuracy: 81.25
deepseek-r1-distill-qwen-1_5b-turbomind:
gsm8k_accuracy: 28.12
race-high_accuracy: 53.12
deepseek-7b-chat-vllm:
gsm8k_accuracy: 56.25
race-high_accuracy: 78.12
gemma2-2b-it-hf:
gsm8k_accuracy: 50
race-high_accuracy: 75
gemma2-9b-it-hf:
gsm8k_accuracy: 68.75
race-high_accuracy: 84.38
gemma-2b-it-hf:
gsm8k_accuracy: 3.12
race-high_accuracy: 40.62
gemma-7b-it-hf:
gsm8k_accuracy: 40.62
race-high_accuracy: 68.75
gemma-2-9b-it-turbomind:
gsm8k_accuracy: 68.75
race-high_accuracy: 84.38
gemma-2-27b-it-turbomind:
gsm8k_accuracy: 78.12
race-high_accuracy: 93.75
gemma-7b-it-vllm:
gsm8k_accuracy: 28.12
race-high_accuracy: 68.75
internlm2_5-7b-chat-hf:
gsm8k_accuracy: 84.38
race-high_accuracy: 90.62
internlm3-8b-instruct-hf:
gsm8k_accuracy: 65.62
race-high_accuracy: 87.5
internlm2_5-7b-chat-turbomind:
gsm8k_accuracy: 81.25
race-high_accuracy: 90.62
internlm2-chat-1.8b-turbomind:
gsm8k_accuracy: 25.00
race-high_accuracy: 84.38
internlm2-chat-1.8b-sft-turbomind:
gsm8k_accuracy: 34.38
race-high_accuracy: 84.38
internlm2-chat-7b-lmdeploy:
gsm8k_accuracy: 59.38
race-high_accuracy: 87.50
internlm2-chat-7b-sft-turbomind:
gsm8k_accuracy: 56.25
race-high_accuracy: 87.50
internlm3-8b-instruct-turbomind:
gsm8k_accuracy: 65.62
race-high_accuracy: 87.5
internlm2-chat-7b-vllm:
gsm8k_accuracy: 53.12
race-high_accuracy: 87.50
llama-3_1-8b-instruct-hf:
gsm8k_accuracy: 84.38
race-high_accuracy: 90.62
llama-3_2-3b-instruct-hf:
gsm8k_accuracy: 71.88
race-high_accuracy: 81.25
llama-3-8b-instruct-hf:
gsm8k_accuracy: 68.75
race-high_accuracy: 87.5
llama-2-7b-chat-turbomind:
gsm8k_accuracy: 18.75
race-high_accuracy: 46.88
llama-3_1-8b-instruct-turbomind:
gsm8k_accuracy: 84.38
race-high_accuracy: 90.62
llama-3_2-3b-instruct-turbomind:
gsm8k_accuracy: 65.62
race-high_accuracy: 81.25
llama-3-8b-instruct-turbomind:
gsm8k_accuracy: 65.62
race-high_accuracy: 84.38
mistral-7b-instruct-v0.2-hf:
gsm8k_accuracy: 40.62
race-high_accuracy: 75
mistral-7b-instruct-v0.3-hf:
gsm8k_accuracy: 40.62
race-high_accuracy: 75
mistral-nemo-instruct-2407-hf:
gsm8k_accuracy: 75
race-high_accuracy: 81.25
mistral-nemo-instruct-2407-turbomind:
gsm8k_accuracy: 71.88
race-high_accuracy: 75
mistral-7b-instruct-v0.1-vllm:
gsm8k_accuracy: 34.38
race-high_accuracy: 65.62
mistral-7b-instruct-v0.2-vllm:
gsm8k_accuracy: 28.12
race-high_accuracy: 78.12
qwen2.5-0.5b-instruct-hf:
gsm8k_accuracy: 34.38
race-high_accuracy: 46.88
qwen2.5-3b-instruct-hf :
gsm8k_accuracy: 53.12
race-high_accuracy: 90.62
qwen2.5-0.5b-instruct-turbomind:
gsm8k_accuracy: 28.12
race-high_accuracy: 43.75
qwen2.5-3b-instruct-turbomind:
gsm8k_accuracy: 56.25
race-high_accuracy: 90.62
qwen1.5-0.5b-chat-hf:
gsm8k_accuracy: 0
race-high_accuracy: 53.12
qwen2-1.5b-instruct-hf:
gsm8k_accuracy: 62.5
race-high_accuracy: 84.38
qwen2-7b-instruct-hf:
gsm8k_accuracy: 68.75
race-high_accuracy: 90.62
qwen2-1.5b-instruct-turbomind:
gsm8k_accuracy: 56.25
race-high_accuracy: 84.38
qwen2-7b-instruct-turbomind:
gsm8k_accuracy: 75.00
race-high_accuracy: 87.50
qwen1.5-0.5b-chat-vllm:
gsm8k_accuracy: 6.25
race-high_accuracy: 53.12
yi-1.5-6b-chat-hf:
gsm8k_accuracy: 65.62
race-high_accuracy: 84.38
yi-1.5-9b-chat-hf:
gsm8k_accuracy: 75
race-high_accuracy: 93.75
yi-1.5-6b-chat-turbomind:
gsm8k_accuracy: 59.38
race-high_accuracy: 84.38
yi-1.5-9b-chat-turbomind:
gsm8k_accuracy: 78.12
race-high_accuracy: 93.75
deepseek-v2_lite-chat-turbomind:
gsm8k_accuracy: 43.75
race-high_accuracy: 71.88
gemma2-27b-it-hf:
gsm8k_accuracy: 71.88
race-high_accuracy: 93.75
internlm2_5-20b-chat-hf:
gsm8k_accuracy: 84.38
race-high_accuracy: 87.5
internlm2_5-20b-chat-turbomind:
gsm8k_accuracy: 87.50
race-high_accuracy: 87.5
mistral-small-instruct-2409-hf:
gsm8k_accuracy: 81.25
race-high_accuracy: 87.50
mistral-small-instruct-2409-turbomind:
gsm8k_accuracy: 78.12
race-high_accuracy: 87.50
phi-4:
gsm8k_accuracy: 81.25
race-high_accuracy: 87.50
qwen2.5-14b-instruct-hf:
gsm8k_accuracy: 71.88
race-high_accuracy: 96.88
qwen2.5-14b-instruct-turbomind:
gsm8k_accuracy: 71.88
race-high_accuracy: 96.88
yi-1.5-34b-chat-turbomind:
gsm8k_accuracy: 71.88
race-high_accuracy: 93.75
deepseek-67b-chat-turbomind:
gsm8k_accuracy: 71.88
race-high_accuracy: 75.00
deepseek-r1-distill-qwen-32b-turbomind:
gsm8k_accuracy: 31.25
race-high_accuracy: 90.62
llama-3_3-70b-instruct-turbomind:
gsm8k_accuracy: 93.75
race-high_accuracy: 87.5
mixtral-large-instruct-2411-turbomind:
gsm8k_accuracy: 87.50
race-high_accuracy: 93.75
nvidia-3_1-Nemotron-70b-instruct-HF-turbomind:
gsm8k_accuracy: 90.62
race-high_accuracy: 53.12
qwen2.5-72b-instruct-turbomind:
gsm8k_accuracy: 78.12
race-high_accuracy: 90.62
deepseek-r1-distill-llama-70b-turbomind:
gsm8k_accuracy: 50.00
race-high_accuracy: 87.50
deepseek-v2_5-1210-turbomind:
gsm8k_accuracy: 90.62
race-high_accuracy: 84.38
mixtral-8x22b-instruct-v0.1-turbomind:
gsm8k_accuracy: 75.00
race-high_accuracy: 78.12
mixtral-8x22b-instruct-v0.1-vllm:
gsm8k_accuracy: 78.12
race-high_accuracy: 78.12
base:
glm-4-9b-turbomind:
gsm8k_accuracy: 59.38
GPQA_diamond_accuracy: 28.12
race-high_accuracy: 93.75
winogrande_accuracy: 84.38
deepseek-7b-base-hf:
gsm8k_accuracy: 25
GPQA_diamond_accuracy: 0
race-high_accuracy: 46.88
winogrande_accuracy: 71.88
deepseek-7b-base-turbomind:
gsm8k_accuracy: 18.75
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 50.00
winogrande_accuracy: 84.38
deepseek-moe-16b-base-vllm:
gsm8k_accuracy: 25.00
GPQA_diamond_accuracy: 0
race-high_accuracy: 25
winogrande_accuracy: 68.75
gemma2-2b-hf:
gsm8k_accuracy: 31.25
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 56.25
winogrande_accuracy: 75.00
gemma2-9b-hf:
gsm8k_accuracy: 75.00
GPQA_diamond_accuracy: 0
race-high_accuracy: 84.38
winogrande_accuracy: 81.25
gemma-2b-hf:
gsm8k_accuracy: 21.88
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 21.88
winogrande_accuracy: 53.12
gemma-7b-hf:
gsm8k_accuracy: 56.25
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 65.62
winogrande_accuracy: 71.88
gemma-2-9b-turbomind:
gsm8k_accuracy: 68.75
GPQA_diamond_accuracy: 0
race-high_accuracy: 84.38
winogrande_accuracy: 81.25
gemma-2b-vllm:
gsm8k_accuracy: 15.62
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 28.12
winogrande_accuracy: 68.75
gemma-7b-vllm:
gsm8k_accuracy: 59.38
GPQA_diamond_accuracy: 6.25
race-high_accuracy: 81.25
winogrande_accuracy: 81.25
internlm2_5-7b-hf:
gsm8k_accuracy: 37.5
GPQA_diamond_accuracy: 25
race-high_accuracy: 93.75
winogrande_accuracy: 71.88
internlm2-7b-hf:
gsm8k_accuracy: 53.12
GPQA_diamond_accuracy: 18.75
race-high_accuracy: 62.5
winogrande_accuracy: 78.12
internlm2-1.8b-turbomind:
gsm8k_accuracy: 12.50
GPQA_diamond_accuracy: 9.38
race-high_accuracy: 71.88
winogrande_accuracy: 75
internlm2_5-7b-turbomind:
gsm8k_accuracy: 62.5
GPQA_diamond_accuracy: 31.25
race-high_accuracy: 93.75
winogrande_accuracy: 87.5
internlm2-7b-turbomind:
gsm8k_accuracy: 53.12
GPQA_diamond_accuracy: 25.00
race-high_accuracy: 78.12
winogrande_accuracy: 71.88
internlm2-base-7b-turbomind:
gsm8k_accuracy: 25.00
GPQA_diamond_accuracy: 34.38
race-high_accuracy: 71.88
winogrande_accuracy: 62.50
llama-2-7b-hf:
gsm8k_accuracy: 21.88
GPQA_diamond_accuracy: 21.88
race-high_accuracy: 40.62
winogrande_accuracy: 71.88
llama-3_1-8b-hf:
gsm8k_accuracy: 78.12
GPQA_diamond_accuracy: 25
race-high_accuracy: 90.62
winogrande_accuracy: 62.5
llama-3-8b-hf:
gsm8k_accuracy: 46.88
GPQA_diamond_accuracy: 6.25
race-high_accuracy: 65.62
winogrande_accuracy: 65.62
llama-3.1-8b-turbomind:
gsm8k_accuracy: 56.25
GPQA_diamond_accuracy: 9.38
race-high_accuracy: 78.12
winogrande_accuracy: 78.12
llama-3-8b-turbomind:
gsm8k_accuracy: 46.88
GPQA_diamond_accuracy: 12.50
race-high_accuracy: 65.62
winogrande_accuracy: 81.25
mistral-7b-v0.3-hf:
gsm8k_accuracy: 31.25
GPQA_diamond_accuracy: 6.25
race-high_accuracy: 62.5
winogrande_accuracy: 59.38
qwen2.5-7b-hf:
gsm8k_accuracy: 81.25
GPQA_diamond_accuracy: 18.75
race-high_accuracy: 87.5
winogrande_accuracy: 71.88
qwen2.5-1.5b-turbomind:
gsm8k_accuracy: 59.38
GPQA_diamond_accuracy: 21.88
race-high_accuracy: 78.12
winogrande_accuracy: 71.88
qwen2.5-7b-turbomind:
gsm8k_accuracy: 78.12
GPQA_diamond_accuracy: 21.88
race-high_accuracy: 87.5
winogrande_accuracy: 75.00
qwen1.5-moe-a2.7b-hf:
gsm8k_accuracy: 62.5
GPQA_diamond_accuracy: 18.75
race-high_accuracy: 84.38
winogrande_accuracy: 75
qwen2-0.5b-hf:
gsm8k_accuracy: 25
GPQA_diamond_accuracy: 0
race-high_accuracy: 40.62
winogrande_accuracy: 62.5
qwen2-1.5b-hf:
gsm8k_accuracy: 59.38
GPQA_diamond_accuracy: 9.38
race-high_accuracy: 81.25
winogrande_accuracy: 62.5
qwen2-7b-hf:
gsm8k_accuracy: 68.75
GPQA_diamond_accuracy: 9.38
race-high_accuracy: 87.5
winogrande_accuracy: 68.75
qwen2-1.5b-turbomind:
gsm8k_accuracy: 56.25
GPQA_diamond_accuracy: 12.50
race-high_accuracy: 81.25
winogrande_accuracy: 75
qwen2-7b-turbomind:
gsm8k_accuracy: 65.62
GPQA_diamond_accuracy: 12.5
race-high_accuracy: 87.5
winogrande_accuracy: 75
qwen1.5-0.5b-vllm:
gsm8k_accuracy: 9.38
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 56.25
winogrande_accuracy: 59.38
yi-1.5-6b-hf:
gsm8k_accuracy: 62.5
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 87.5
winogrande_accuracy: 62.5
yi-1.5-9b-hf:
gsm8k_accuracy: 75
GPQA_diamond_accuracy: 40.62
race-high_accuracy: 87.5
winogrande_accuracy: 59.38
yi-1.5-9b-turbomind:
gsm8k_accuracy: 75.00
GPQA_diamond_accuracy: 40.62
race-high_accuracy: 87.5
winogrande_accuracy: 65.62
internlm2-20b-turbomind:
gsm8k_accuracy: 71.88
GPQA_diamond_accuracy: 18.75
race-high_accuracy: 68.75
winogrande_accuracy: 81.25
qwen2.5-14b-hf:
gsm8k_accuracy: 75
GPQA_diamond_accuracy: 37.5
race-high_accuracy: 93.75
winogrande_accuracy: 84.38
qwen2.5-32b-hf:
gsm8k_accuracy: 87.5
GPQA_diamond_accuracy: 31.25
race-high_accuracy: 93.75
winogrande_accuracy: 78.12
qwen2.5-32b-turbomind:
gsm8k_accuracy: 90.62
GPQA_diamond_accuracy: 31.25
race-high_accuracy: 93.75
winogrande_accuracy: 81.25
deepseek-67b-base-turbomind:
gsm8k_accuracy: 62.50
GPQA_diamond_accuracy: 31.25
race-high_accuracy: 78.12
winogrande_accuracy: 81.25
llama-3-70b-turbomind:
gsm8k_accuracy: 56.25
GPQA_diamond_accuracy: 15.62
race-high_accuracy: 93.75
winogrande_accuracy: 84.38
qwen2.5-72b-turbomind:
gsm8k_accuracy: 84.38
GPQA_diamond_accuracy: 40.62
race-high_accuracy: 93.75
winogrande_accuracy: 87.5
deepseek-v2-turbomind:
gsm8k_accuracy: 65.62
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 93.75
winogrande_accuracy: 81.25

77
.github/scripts/pr_oc_score_assert.py vendored Normal file
View File

@ -0,0 +1,77 @@
import csv
import os
import pytest
output_path = 'regression_result'
model = 'internlm2-chat-7b-hf'
dataset = 'siqa'
@pytest.fixture()
def result_scores():
file = find_csv_files(output_path)
if file is None:
return None
return read_csv_file(file)
@pytest.mark.usefixtures('result_scores')
class TestChatScore:
"""Test cases for chat model."""
def test_model_dataset_score(self, result_scores):
result_score = result_scores.get(model).get(dataset)
assert_score(result_score, 79.53)
def assert_score(score, baseline):
if score is None or score == '-':
assert False, 'value is none'
if float(score) < (baseline * 1.03) and float(score) > (baseline * 0.97):
print(score + ' between ' + str(baseline * 0.97) + ' and ' +
str(baseline * 1.03))
assert True
else:
assert False, score + ' not between ' + str(
baseline * 0.97) + ' and ' + str(baseline * 1.03)
def find_csv_files(directory):
csv_files = []
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith('.csv'):
csv_files.append(os.path.join(root, file))
if len(csv_files) > 1:
raise 'have more than 1 result file, please check the result manually'
if len(csv_files) == 0:
return None
return csv_files[0]
def read_csv_file(file_path):
with open(file_path, 'r') as csvfile:
reader = csv.DictReader(csvfile)
filtered_data = []
for row in reader:
filtered_row = {
k: v
for k, v in row.items()
if k not in ['version', 'metric', 'mode']
}
filtered_data.append(filtered_row)
result = {}
for data in filtered_data:
dataset = data.get('dataset')
for key in data.keys():
if key == 'dataset':
continue
else:
if key in result.keys():
result.get(key)[dataset] = data.get(key)
else:
result[key] = {dataset: data.get(key)}
return result

351
.github/workflows/daily-run-test.yml vendored Normal file
View File

@ -0,0 +1,351 @@
name: daily_run_test
on:
workflow_dispatch:
inputs:
repo_org:
required: false
description: 'Tested repository organization name. Default is open-compass/opencompass'
type: string
default: 'open-compass/opencompass'
repo_ref:
required: false
description: 'Set branch or tag or commit id. Default is "main"'
type: string
default: 'main'
build_lmdeploy:
required: false
description: 'whether to build lmdeploy'
type: boolean
default: false
repo_org_lmdeploy:
required: false
description: 'Tested repository organization name. Default is internlm/lmdeploy'
type: string
default: 'InternLM/lmdeploy'
repo_ref_lmdeploy:
required: false
description: 'Set branch or tag or commit id. Default is "main"'
type: string
default: 'main'
regression_func_volc:
required: true
description: 'regression functions'
type: string
default: "['chat_models','base_models', 'chat_obj_fullbench', 'base_fullbench']"
regression_func_local:
required: true
description: 'regression functions'
type: string
default: "['cmd', 'api', 'chat_sub_fullbench']"
fullbench_eval:
required: true
description: 'fullbench volc functions'
type: string
default: "['base_objective','chat_objective','chat_subjective','base_long_context','chat_long_context']"
schedule:
- cron: '15 14 * * 0,3'
env:
HF_DATASETS_OFFLINE: 1
HF_EVALUATE_OFFLINE: 1
TRANSFORMERS_OFFLINE: 1
VLLM_USE_MODELSCOPE: false
LMDEPLOY_USE_MODELSCOPE: false
HF_HUB_OFFLINE: 1
OUTPUT_FOLDER: cuda12.1_dist_${{ github.run_id }}
CONDA_PATH: ${{ secrets.WORKSPACE_PREFIX }}/miniconda3
PIP_CACHE_PATH: ${{ secrets.WORKSPACE_PREFIX }}/.cache/pip
REPORT_ROOT: ${{ secrets.WORKSPACE_PREFIX }}/eval_report/regression
COMPASS_DATA_CACHE: ${{ secrets.SHARESPACE_PREFIX }}/datasets/compass_data_cache
HUGGINGFACE_HUB_CACHE: ${{ secrets.SHARESPACE_PREFIX }}/models/opencompass_hf_hub
HF_HUB_CACHE: ${{ secrets.SHARESPACE_PREFIX }}/models/opencompass_hf_hub
HF_DATASETS_CACHE: ${{ secrets.SHARESPACE_PREFIX }}/datasets/hf_datasets_cache
HF_ENDPOINT: https://hf-mirror.com
CONDA_ENV: regression_test
export VLLM_WORKER_MULTIPROC_METHOD: spawn
jobs:
build-pypi:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
repository: ${{ github.event.inputs.repo_org || 'open-compass/opencompass' }}
ref: ${{github.event.inputs.repo_ref || 'main'}}
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Build lagent
run: |
pip install wheel setuptools
python setup.py sdist bdist_wheel
- name: Upload Artifacts
uses: actions/upload-artifact@v4
with:
if-no-files-found: error
path: dist/*
retention-days: 1
name: my-artifact-${{ github.run_id }}
build-pypi-lmdeploy:
if: ${{!cancelled() && (github.event_name == 'schedule' || inputs.build_lmdeploy)}}
strategy:
matrix:
pyver: [py310]
runs-on: ubuntu-latest
env:
PYTHON_VERSION: ${{ matrix.pyver }}
PLAT_NAME: manylinux2014_x86_64
DOCKER_TAG: cuda12.1
steps:
- name: Checkout repository
uses: actions/checkout@v3
with:
repository: ${{ github.event.inputs.repo_org_lmdeploy || 'InternLM/lmdeploy' }}
ref: ${{github.event.inputs.repo_ref_lmdeploy || 'main'}}
- name: Build
run: |
echo ${PYTHON_VERSION}
echo ${PLAT_NAME}
echo ${DOCKER_TAG}
echo ${OUTPUT_FOLDER}
echo ${GITHUB_RUN_ID}
# remove -it
sed -i 's/docker run --rm -it/docker run --rm/g' builder/manywheel/build_wheel.sh
bash builder/manywheel/build_wheel.sh ${PYTHON_VERSION} ${PLAT_NAME} ${DOCKER_TAG} ${OUTPUT_FOLDER}
- name: Upload Artifacts
uses: actions/upload-artifact@v4
with:
if-no-files-found: error
path: builder/manywheel/${{ env.OUTPUT_FOLDER }}
retention-days: 1
name: my-artifact-${{ github.run_id }}-${{ matrix.pyver }}
prepare_env:
if: ${{!cancelled()}}
needs: ['build-pypi', 'build-pypi-lmdeploy']
runs-on: volc_cu12
timeout-minutes: 120 #2hours
steps:
- name: Clone repository
uses: actions/checkout@v2
with:
repository: ${{ github.event.inputs.repo_org || 'open-compass/opencompass' }}
ref: ${{github.event.inputs.repo_ref || 'main'}}
- name: Download Artifacts
uses: actions/download-artifact@v4
with:
name: my-artifact-${{ github.run_id }}
- name: Remove Conda Env
if: always()
run: |
. ${{ secrets.WORKSPACE_PREFIX }}/miniconda3/bin/activate
conda env remove -y --name ${{env.CONDA_ENV}}
conda info --envs
- name: Prepare - create conda env and install torch - cu12
uses: nick-fields/retry@v3
with:
max_attempts: 3
timeout_minutes: 120
command: |
. ${{env.CONDA_PATH}}/bin/activate
conda create -y --name ${{env.CONDA_ENV}} python=3.10
conda activate ${{env.CONDA_ENV}}
pip install -r ${{ secrets.WORKSPACE_PREFIX }}/config/requirements.txt --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass*.whl --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass[lmdeploy] --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass[vllm] --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass[full] --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass[api] --cache-dir ${{env.PIP_CACHE_PATH}}
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --cache-dir ${{env.PIP_CACHE_PATH}}
FLASH_ATTENTION_FORCE_BUILD=TRUE pip install ${{ secrets.WORKSPACE_PREFIX }}/packages/flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install xformers --index-url https://download.pytorch.org/whl/cu121 --cache-dir ${{env.PIP_CACHE_PATH}}
cp -r /root/nltk_data ${{env.CONDA_PATH}}/envs/${{env.CONDA_ENV}}/nltk_data
- name: Prepare - reinstall lmdeploy - cu12
if: ${{github.event_name == 'schedule' || inputs.build_lmdeploy}}
uses: actions/download-artifact@v4
with:
name: my-artifact-${{ github.run_id }}-py310
- name: Prepare - reinstall lmdeploy - cu12
if: ${{github.event_name == 'schedule' || inputs.build_lmdeploy}}
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
pip uninstall -y lmdeploy
pip install lmdeploy-*.whl --no-deps
- name: conda env
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
pip list
daily_run_test_volc:
if: ${{!cancelled() && contains(needs.prepare_env.result, 'success')}}
needs: prepare_env
strategy:
fail-fast: false
matrix:
regression_func: ${{fromJSON(github.event.inputs.regression_func_volc || '["chat_models","base_models","chat_obj_fullbench","base_fullbench"]')}}
runs-on: volc_cu12_daily
timeout-minutes: 180 #3hours
steps:
- name: Clone repository
uses: actions/checkout@v2
with:
repository: ${{ github.event.inputs.repo_org || 'open-compass/opencompass' }}
ref: ${{github.event.inputs.repo_ref || 'main'}}
- name: conda env
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
pip list
- name: modify config
if: matrix.regression_func != 'chat_sub_fullbench'
run: |
cp -r ${{ secrets.WORKSPACE_PREFIX }}/ocplayground/template/configs_cluster/volc.py .
cat ${{ secrets.WORKSPACE_PREFIX }}/config/test_config.txt >> .github/scripts/eval_regression_${{matrix.regression_func}}.py
- name: Run test
uses: nick-fields/retry@v3
with:
max_attempts: 1
timeout_minutes: 180
command: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
opencompass .github/scripts/eval_regression_${{matrix.regression_func}}.py --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/${{matrix.regression_func}} --reuse --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/${{matrix.regression_func}}/*/summary regression_result_daily
python -m pytest -m ${{matrix.regression_func}} -s -v --color=yes .github/scripts/oc_score_assert.py
daily_run_test_local:
if: ${{!cancelled() && contains(needs.prepare_env.result, 'success')}}
needs: prepare_env
strategy:
fail-fast: false
matrix:
regression_func: ${{fromJSON(github.event.inputs.regression_func_local || '["cmd","api","chat_sub_fullbench"]')}}
runs-on: volc_cu12_local
timeout-minutes: 480 #6hours
steps:
- name: Clone repository
uses: actions/checkout@v2
with:
repository: ${{ github.event.inputs.repo_org || 'open-compass/opencompass' }}
ref: ${{github.event.inputs.repo_ref || 'main'}}
- name: conda env
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
pip list
- name: modify config
if: matrix.regression_func == 'chat_sub_fullbench'
run: |
cp -r ${{ secrets.WORKSPACE_PREFIX }}/ocplayground/template/configs_cluster/volc.py .
cat ${{ secrets.WORKSPACE_PREFIX }}/config/test_config_sub.txt >> .github/scripts/eval_regression_${{matrix.regression_func}}.py
- name: Run command testcase
if: matrix.regression_func == 'cmd'
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
export from_tf=TRUE
python tools/list_configs.py internlm2_5 mmlu
opencompass --models hf_internlm2_5_7b --datasets race_ppl demo_gsm8k_chat_gen --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd1 --reuse --max-num-workers 2 --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd1/*/summary regression_result_daily
python -m pytest -m case1 -s -v --color=yes .github/scripts/oc_score_assert.py
opencompass --models hf_internlm2_5_7b_chat hf_internlm3_8b_instruct --datasets race_gen demo_gsm8k_chat_gen -a lmdeploy --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd2 --reuse --max-num-workers 2 --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd2/*/summary regression_result_daily
python -m pytest -m case2 -s -v --color=yes .github/scripts/oc_score_assert.py
opencompass --datasets race_ppl demo_gsm8k_chat_gen --hf-type base --hf-path internlm/internlm2_5-7b --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd3 --reuse --max-num-workers 2 --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd3/*/summary regression_result_daily
python -m pytest -m case3 -s -v --color=yes .github/scripts/oc_score_assert.py
opencompass --datasets race_gen demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm3-8b-instruct -a lmdeploy --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd4 --reuse --max-num-workers 2 --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd4/*/summary regression_result_daily
python -m pytest -m case4 -s -v --color=yes .github/scripts/oc_score_assert.py
opencompass --datasets race_gen demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm3-8b-instruct -a vllm --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd5 --reuse --max-num-workers 2 --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd5/*/summary regression_result_daily
python -m pytest -m case5 -s -v --color=yes .github/scripts/oc_score_assert.py
- name: Run model test - api
if: matrix.regression_func == 'api'
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
lmdeploy serve api_server internlm/internlm3-8b-instruct --max-batch-size 256 --model-name internlm3 > ${{env.REPORT_ROOT}}/${{ github.run_id }}/restful.log 2>&1 &
echo "restful_pid=$!" >> "$GITHUB_ENV"
sleep 180s
env | grep PROXY
env | grep proxy
unset HTTP_PROXY;unset HTTPS_PROXY;unset http_proxy;unset https_proxy;
opencompass .github/scripts/eval_regression_api.py --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/api --reuse --max-num-workers 2 --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/api/*/summary regression_result_daily
python -m pytest -m api -s -v --color=yes .github/scripts/oc_score_assert.py
- name: Run model test - api kill
if: always() && matrix.regression_func == 'api'
run: |
kill -15 "$restful_pid"
- name: Run testcase
if: matrix.regression_func == 'chat_sub_fullbench'
env:
COMPASS_DATA_CACHE: ${{ secrets.SHARESPACE_PREFIX }}/datasets/compass_data_cache_subset
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
export from_tf=TRUE
opencompass .github/scripts/eval_regression_${{matrix.regression_func}}.py --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/${{matrix.regression_func}} --reuse --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/${{matrix.regression_func}}/*/summary regression_result_daily
python -m pytest -m ${{matrix.regression_func}} -s -v --color=yes .github/scripts/oc_score_assert.py
fullbench_run_test:
if: ${{!cancelled() && contains(needs.prepare_env.result, 'success')}}
needs: prepare_env
strategy:
fail-fast: false
matrix:
function_type: ${{fromJSON(github.event.inputs.fullbench_eval || '["base_objective","chat_objective","chat_subjective","base_long_context","chat_long_context"]')}}
runs-on: volc_cu12
timeout-minutes: 480 #6hours
steps:
- name: Clone repository
uses: actions/checkout@v2
with:
repository: ${{ github.event.inputs.repo_org || 'open-compass/opencompass' }}
ref: ${{github.event.inputs.repo_ref || 'main'}}
- name: conda env
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
pip list
- name: Run testcase
uses: nick-fields/retry@v3
with:
max_attempts: 1
timeout_minutes: 480
command: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
export from_tf=TRUE
opencompass ${{ secrets.WORKSPACE_PREFIX }}/ocplayground/template/regression/eval_${{ matrix.function_type }}.py --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/${{ matrix.function_type }} --reuse
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/${{ matrix.function_type }}/*/summary regression_result_daily
python -m pytest -m ${{ matrix.function_type }} -s -v --color=yes .github/scripts/oc_score_assert.py
notify_to_feishu:
if: ${{ always() && github.event_name == 'schedule' && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'develop' || github.ref_name == 'main') }}
needs: [daily_run_test_volc, daily_run_test_local, fullbench_run_test]
timeout-minutes: 5
runs-on: self-hosted
steps:
- name: notify
run: |
curl -X POST -H "Content-Type: application/json" -d '{"msg_type":"post","content":{"post":{"zh_cn":{"title":"Opencompass- Daily test failed","content":[[{"tag":"text","text":"branch: ${{github.ref_name}}, run action: ${{github.workflow}} failed. "},{"tag":"a","text":"Please click here for details ","href":"https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"},{"tag":"at","user_id":"'${{ secrets.USER_ID }}'"}]]}}}}' ${{ secrets.WEBHOOK_URL }}

26
.github/workflows/link-check.yml vendored Normal file
View File

@ -0,0 +1,26 @@
name: 'Link check'
on:
schedule:
# check links at 01:30 a.m. every day
- cron: '30 1 * * *'
workflow_dispatch: # allow manual trigger
jobs:
link-check:
runs-on: ubuntu-latest
steps:
# - uses: actions/checkout@v3
- name: Install linkchecker
run: |
pip install linkchecker
- name: Run linkchecker
run: |
linkchecker https://opencompass.readthedocs.io/ --no-robots -t 30 --no-warnings \
--ignore-url "https://opencompass.readthedocs.io/.*/static/images/opencompass_logo.svg" \
--ignore-url "https://opencompass.readthedocs.io/.*/_static/images/icon-menu-dots.svg" \
--ignore-url "https://opencompass.readthedocs.io/policy" \
--ignore-url "https://opencompass.readthedocs.io/(en|zh_CN)/[0-9a-f]{40}/.*"

View File

@ -17,7 +17,7 @@ jobs:
python-version: '3.10'
- name: Install pre-commit hook
run: |
pip install pre-commit mmengine
pip install pre-commit==3.8.0 mmengine==0.10.5
pre-commit install
- name: Linting
run: pre-commit run --all-files

106
.github/workflows/pr-run-test.yml vendored Normal file
View File

@ -0,0 +1,106 @@
name: pr_run_test
on:
pull_request:
paths-ignore:
- 'README.md'
- 'README_zh-CN.md'
- 'docs/**'
- 'configs/**'
- 'tools/**'
workflow_dispatch:
schedule:
- cron: '56 22 * * *'
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
CONDA_ENV: pr_test
HF_DATASETS_OFFLINE: 1
HF_EVALUATE_OFFLINE: 1
TRANSFORMERS_OFFLINE: 1
VLLM_USE_MODELSCOPE: false
LMDEPLOY_USE_MODELSCOPE: false
HF_HUB_OFFLINE: 1
CONDA_PATH: /fs-computility/llm/qa-llm-cicd/miniconda3
PIP_CACHE_PATH: /fs-computility/llm/qa-llm-cicd/.cache/pip
REPORT_ROOT: /fs-computility/llm/qa-llm-cicd/eval_report/prtest
COMPASS_DATA_CACHE: /fs-computility/llm/shared/llmeval/datasets/compass_data_cache
HUGGINGFACE_HUB_CACHE: /fs-computility/llm/shared/llmeval/models/opencompass_hf_hub
HF_HUB_CACHE: /fs-computility/llm/shared/llmeval/models/opencompass_hf_hub
jobs:
pr_run_test:
runs-on: volc_cu12_local
environment: 'prod'
timeout-minutes: 30
steps:
- name: Checkout repository
uses: actions/checkout@v2
- name: Prepare - Install opencompass
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
python3 -m pip uninstall opencompass -y
python3 -m pip install -e ".[full]" --cache-dir ${{env.PIP_CACHE_PATH}}
conda info --envs
- name: conda env
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
pip list
lmdeploy check_env
- name: Run test
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
rm -rf regression_result
opencompass --models hf_internlm2_5_20b_chat --datasets demo_gsm8k_chat_gen --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/regression_result1 --debug
opencompass --models hf_internlm2_5_7b_chat --datasets demo_gsm8k_chat_gen --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/regression_result2 --debug --max-num-workers 2
opencompass --models hf_internlm2_5_7b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/regression_result3 --debug --max-num-workers 2
- name: Get result
run: |
score=$(sed -n '$p' ${{env.REPORT_ROOT}}/${{ github.run_id }}/regression_result1/*/summary/*.csv | awk -F ',' '{print $NF}')
if (( ${score%.*} >= 88 && ${score%.*} <= 89 )); then
echo "score is $score between 88 and 89"
else
echo "score is $score not between 88 and 89"
exit 1
fi
score=$(sed -n '$p' ${{env.REPORT_ROOT}}/${{ github.run_id }}/regression_result2/*/summary/*.csv | awk -F ',' '{print $NF}')
if (( ${score%.*} >= 87 && ${score%.*} <= 88 )); then
echo "score is $score between 87 and 88"
else
echo "score is $score not between 87 and 88"
exit 1
fi
score=$(sed -n '$p' ${{env.REPORT_ROOT}}/${{ github.run_id }}/regression_result3/*/summary/*.csv | awk -F ',' '{print $NF}')
if (( ${score%.*} >= 87 && ${score%.*} <= 91 )); then
echo "score is $score between 87 and 91"
else
echo "score is $score not between 87 and 91"
exit 1
fi
- name: Uninstall opencompass
if: always()
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
python3 -m pip uninstall opencompass -y
conda info --envs
notify_to_feishu:
if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'develop' || github.ref_name == 'main') }}
needs: [pr_run_test]
timeout-minutes: 5
runs-on: self-hosted
environment: 'prod'
steps:
- name: notify
run: |
curl -X POST -H "Content-Type: application/json" -d '{"msg_type":"post","content":{"post":{"zh_cn":{"title":"Opencompass- pr test failed","content":[[{"tag":"text","text":"branch: ${{github.ref_name}}, run action: ${{github.workflow}} failed. "},{"tag":"a","text":"Please click here for details ","href":"https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"},{"tag":"at","user_id":"'${{ secrets.USER_ID }}'"}]]}}}}' ${{ secrets.WEBHOOK_URL }}

121
.github/workflows/pr-stage-check.yml vendored Normal file
View File

@ -0,0 +1,121 @@
name: pr_stage_test
on:
pull_request:
paths-ignore:
- 'README.md'
- 'README_zh-CN.md'
- 'docs/**'
- 'configs/**'
- 'tools/**'
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
build:
runs-on: ubuntu-22.04
strategy:
matrix:
python-version: ['3.10']
include:
- torch: 2.5.1
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Upgrade pip
run: python -m pip install --upgrade pip
- name: Install PyTorch
run: pip install torch==${{matrix.torch}} -f https://download.pytorch.org/whl/cpu/torch_stable.html
- name: Install system dependencies
run: |
sudo sed -i '$ a deb http://th.archive.ubuntu.com/ubuntu jammy main' /etc/apt/sources.list
sudo apt-get update && sudo apt-get install -y libc6 libffi-dev libncursesw6 wget unzip
- name: Upgrade pip
run: python -m pip install pip --upgrade
- name: Install opencompass dependencies
run: |
python -m pip install -r requirements.txt
- name: Build and install
run: python -m pip install -e .
- name: Prepare dataset
run: |
wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-20240207.zip
unzip OpenCompassData-core-20240207.zip
- name: Dry run test
run: |
python run.py --models hf_opt_125m --datasets siqa_gen winograd_ppl --dry-run
build_cu117:
runs-on: ubuntu-22.04
container:
image: nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04
strategy:
matrix:
python-version: ['3.10']
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Fetch GPG keys
run: |
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub
- name: Install Python-dev
run: apt-get update && apt-get install -y python${{matrix.python-version}}-dev
if: ${{matrix.python-version != 3.10}}
- name: Install system dependencies
run: |
apt-get update
apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libxrender-dev libc6 libc6-dev
sed -i '$ a deb http://th.archive.ubuntu.com/ubuntu jammy main' /etc/apt/sources.list
apt-get update && apt-get install -y libc6 libffi-dev libncursesw6 wget unzip
- name: Upgrade pip
run: python -m pip install pip --upgrade
- name: Install opencompass dependencies
run: |
python -m pip install -r requirements.txt
- name: Build and install
run: python -m pip install -e .
- name: Prepare dataset
run: |
wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-20240207.zip
unzip OpenCompassData-core-20240207.zip
- name: Dry run test
run: |
python run.py --models hf_opt_125m --datasets siqa_gen winograd_ppl --dry-run
build_windows:
runs-on: windows-2022
strategy:
matrix:
python-version: ['3.10']
platform: [cpu]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Upgrade pip
run: python -m pip install pip --upgrade
- name: Install PyTorch
run: pip install torch==2.5.1 -f https://download.pytorch.org/whl/cpu/torch_stable.html
- name: Install opencompass dependencies
run: |
pip install -r requirements.txt
- name: Build and install
run: pip install -e .
- name: Prepare dataset
run: |
Invoke-WebRequest -Uri https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-20240207.zip -OutFile OpenCompassData-core-20240207.zip
unzip OpenCompassData-core-20240207.zip
- name: Dry run test
run: |
python run.py --models hf_opt_125m --datasets siqa_gen winograd_ppl --dry-run

View File

@ -1,21 +1,26 @@
name: deploy
on: push
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
on:
push:
workflow_dispatch:
inputs:
confirm_publish:
description: 'Type YES to confirm publishing to PyPI'
required: true
type: string
jobs:
build-n-publish:
runs-on: ubuntu-latest
if: startsWith(github.event.ref, 'refs/tags')
if: |
github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags') ||
(github.event_name == 'workflow_dispatch' && inputs.confirm_publish == 'YES')
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.7
uses: actions/setup-python@v1
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: 3.7
python-version: '3.10'
- name: Build lagent
run: |
pip install wheel

42
.gitignore vendored
View File

@ -1,16 +1,22 @@
.DS_Store
output_*/
outputs/
scripts/
icl_inference_output/
.vscode/
tmp/
configs/eval_subjective_alignbench_test.py
configs/openai_key.py
configs/secrets.py
configs/datasets/log.json
configs/eval_debug*.py
configs/viz_*.py
configs/**/*_bkup.py
opencompass/**/*_bkup.py
data
work_dirs
outputs
models/*
configs/internal/
# Byte-compiled / optimized / DLL files
__pycache__/
@ -89,7 +95,41 @@ docs/zh_cn/_build/
# sft config ignore list
configs/sft_cfg/*B_*
configs/sft_cfg/1B/*
configs/sft_cfg/7B/*
configs/sft_cfg/20B/*
configs/sft_cfg/60B/*
configs/sft_cfg/100B/*
configs/cky/
configs/_internal_legacy*
# in case llama clone in the opencompass
llama/
# in case ilagent clone in the opencompass
ilagent/
# ignore the config file for criticbench evaluation
configs/sft_cfg/criticbench_eval/*
# path of turbomind's model after runing `lmdeploy.serve.turbomind.deploy`
turbomind/
# cibench output
*.db
*.pth
*.pt
*.onnx
*.gz
*.gz.*
*.png
*.txt
*.jpg
*.json
*.jsonl
*.csv
*.npy
*.c
# aliyun
core.*

View File

@ -7,8 +7,8 @@ assign:
scedule:
'*/1 * * * *'
assignees:
- Leymore
- bittersweet1999
- yingfhu
- kennymckormick
- liushz
- MaiziXiao
- acylam
- tonysy

View File

@ -1,28 +1,50 @@
exclude: |
(?x)^(
tests/data/|
tests/dataset/|
opencompass/models/internal/|
opencompass/utils/internal/|
opencompass/openicl/icl_evaluator/hf_metrics/|
opencompass/datasets/lawbench/utils|
opencompass/datasets/lawbench/evaluation_functions/
opencompass/datasets/lawbench/evaluation_functions/|
opencompass/datasets/medbench/|
opencompass/datasets/teval/|
opencompass/datasets/NPHardEval/|
opencompass/datasets/TheoremQA|
opencompass/datasets/subjective/mtbench101.py|
docs/zh_cn/advanced_guides/compassbench_intro.md |
docs/zh_cn/advanced_guides/compassbench_v2_0.md |
opencompass/utils/datasets.py |
opencompass/utils/datasets_info.py
)
repos:
- repo: https://gitee.com/openmmlab/mirrors-flake8
rev: 5.0.4
hooks:
- id: flake8
exclude: configs/
exclude: |
(?x)^(
opencompass/configs/|
examples/
)
- repo: https://gitee.com/openmmlab/mirrors-isort
rev: 5.11.5
hooks:
- id: isort
exclude: configs/
exclude: |
(?x)^(
opencompass/configs/|
examples/
)
- repo: https://gitee.com/openmmlab/mirrors-yapf
rev: v0.32.0
hooks:
- id: yapf
exclude: configs/
exclude: |
(?x)^(
opencompass/configs/|
examples/
)
- repo: https://gitee.com/openmmlab/mirrors-codespell
rev: v2.2.1
hooks:
@ -30,7 +52,9 @@ repos:
exclude: |
(?x)^(
.*\.jsonl|
configs/
.*\.md.template|
opencompass/configs/ |
examples/
)
- repo: https://gitee.com/openmmlab/mirrors-pre-commit-hooks
rev: v4.3.0
@ -40,7 +64,6 @@ repos:
(?x)^(
dicts/|
projects/.*?/dicts/|
configs/
)
- id: check-yaml
- id: end-of-file-fixer
@ -48,18 +71,14 @@ repos:
(?x)^(
dicts/|
projects/.*?/dicts/|
configs/
)
- id: requirements-txt-fixer
- id: double-quote-string-fixer
exclude: configs/
- id: check-merge-conflict
- id: fix-encoding-pragma
args: ["--remove"]
- id: mixed-line-ending
args: ["--fix=lf"]
- id: mixed-line-ending
args: ["--fix=lf"]
- repo: https://gitee.com/openmmlab/mirrors-mdformat
rev: 0.7.9
hooks:
@ -83,7 +102,25 @@ repos:
language: script
pass_filenames: true
require_serial: true
files: ^configs/datasets
files: ^opencompass/configs/datasets
- repo: local
hooks:
- id: update-dataset-suffix-pacakge
name: dataset suffix updater(package)
entry: ./tools/update_dataset_suffix.py
language: script
pass_filenames: false
# require_serial: true
# files: ^opencompass/configs/datasets
args:
- --root_folder
- opencompass/configs/datasets
- repo: https://gitee.com/mirrors/gitleaks
rev: v8.23.1
hooks:
- id: gitleaks
entry: "gitleaks dir"
args: ["--verbose", "--redact=50"]
# - repo: https://github.com/open-mmlab/pre-commit-hooks
# rev: v0.2.0 # Use the ref you want to point at
# hooks:

View File

@ -1,28 +1,51 @@
exclude: |
(?x)^(
tests/data/|
tests/dataset/|
opencompass/models/internal/|
opencompass/utils/internal/|
opencompass/openicl/icl_evaluator/hf_metrics/|
opencompass/datasets/lawbench/utils|
opencompass/datasets/lawbench/evaluation_functions/
opencompass/datasets/lawbench/evaluation_functions/|
opencompass/datasets/medbench/|
opencompass/datasets/matbench/|
opencompass/datasets/teval/|
opencompass/datasets/NPHardEval/|
opencompass/datasets/TheoremQA|
opencompass/datasets/subjective/mtbench101.py|
docs/zh_cn/advanced_guides/compassbench_intro.md |
docs/zh_cn/advanced_guides/compassbench_v2_0.md |
opencompass/utils/datasets.py |
opencompass/utils/datasets_info.py
)
repos:
- repo: https://github.com/PyCQA/flake8
rev: 5.0.4
hooks:
- id: flake8
exclude: configs/
exclude: |
(?x)^(
opencompass/configs/|
examples/
)
- repo: https://github.com/PyCQA/isort
rev: 5.11.5
hooks:
- id: isort
exclude: configs/
exclude: |
(?x)^(
opencompass/configs/|
examples/
)
- repo: https://github.com/pre-commit/mirrors-yapf
rev: v0.32.0
hooks:
- id: yapf
exclude: configs/
exclude: |
(?x)^(
opencompass/configs/|
examples/
)
- repo: https://github.com/codespell-project/codespell
rev: v2.2.1
hooks:
@ -30,7 +53,9 @@ repos:
exclude: |
(?x)^(
.*\.jsonl|
configs/
.*\.md.template|
opencompass/configs/ |
examples/
)
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
@ -40,7 +65,6 @@ repos:
(?x)^(
dicts/|
projects/.*?/dicts/|
configs/
)
- id: check-yaml
- id: end-of-file-fixer
@ -48,18 +72,14 @@ repos:
(?x)^(
dicts/|
projects/.*?/dicts/|
configs/
)
- id: requirements-txt-fixer
- id: double-quote-string-fixer
exclude: configs/
- id: check-merge-conflict
- id: fix-encoding-pragma
args: ["--remove"]
- id: mixed-line-ending
args: ["--fix=lf"]
- id: mixed-line-ending
args: ["--fix=lf"]
- repo: https://github.com/executablebooks/mdformat
rev: 0.7.9
hooks:
@ -83,7 +103,25 @@ repos:
language: script
pass_filenames: true
require_serial: true
files: ^configs/datasets
files: ^opencompass/configs/datasets
- repo: local
hooks:
- id: update-dataset-suffix-pacakge
name: dataset suffix updater(package)
entry: ./tools/update_dataset_suffix.py
language: script
pass_filenames: false
# require_serial: true
# files: ^opencompass/configs/datasets
args:
- --root_folder
- opencompass/configs/datasets
- repo: https://github.com/gitleaks/gitleaks
rev: v8.23.1
hooks:
- id: gitleaks
entry: "gitleaks dir"
args: ["--verbose", "--redact=50"]
# - repo: https://github.com/open-mmlab/pre-commit-hooks
# rev: v0.2.0 # Use the ref you want to point at
# hooks:

3
MANIFEST.in Normal file
View File

@ -0,0 +1,3 @@
recursive-include opencompass/configs *.py *.yml *.json *.txt *.md
recursive-include opencompass/openicl/icl_evaluator/hf_metrics *.py
recursive-include opencompass/datasets *.py *.yml *.json *.txt *.md *.yaml

675
README.md
View File

@ -3,35 +3,44 @@
<br />
<br />
[![docs](https://readthedocs.org/projects/opencompass/badge)](https://opencompass.readthedocs.io/en)
[![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](https://github.com/open-compass/opencompass/blob/main/LICENSE)
[![][github-release-shield]][github-release-link]
[![][github-releasedate-shield]][github-releasedate-link]
[![][github-contributors-shield]][github-contributors-link]<br>
[![][github-forks-shield]][github-forks-link]
[![][github-stars-shield]][github-stars-link]
[![][github-issues-shield]][github-issues-link]
[![][github-license-shield]][github-license-link]
<!-- [![PyPI](https://badge.fury.io/py/opencompass.svg)](https://pypi.org/project/opencompass/) -->
[🌐Website](https://opencompass.org.cn/) |
[📖CompassHub](https://hub.opencompass.org.cn/home) |
[📊CompassRank](https://rank.opencompass.org.cn/home) |
[📘Documentation](https://opencompass.readthedocs.io/en/latest/) |
[🛠Installation](https://opencompass.readthedocs.io/en/latest/get_started/installation.html) |
[🤔Reporting Issues](https://github.com/open-compass/opencompass/issues/new/choose)
English | [简体中文](README_zh-CN.md)
[![][github-trending-shield]][github-trending-url]
</div>
<p align="center">
👋 join us on <a href="https://discord.gg/KKwfEbFj7U" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=opencompass" target="_blank">WeChat</a>
</p>
## 📣 OpenCompass 2023 LLM Annual Leaderboard
> \[!IMPORTANT\]
>
> **Star Us**, You will receive all release notifications from GitHub without any delay ~ ⭐️
We are honored to have witnessed the tremendous progress of artificial general intelligence together with the community in the past year, and we are also very pleased that **OpenCompass** can help numerous developers and users.
We announce the launch of the **OpenCompass 2023 LLM Annual Leaderboard** plan. We expect to release the annual leaderboard of the LLMs in January 2024, systematically evaluating the performance of LLMs in various capabilities such as language, knowledge, reasoning, creation, long-text, and agents.
At that time, we will release rankings for both open-source models and commercial API models, aiming to provide a comprehensive, objective, and neutral reference for the industry and research community.
We sincerely invite various large models to join the OpenCompass to showcase their performance advantages in different fields. At the same time, we also welcome researchers and developers to provide valuable suggestions and contributions to jointly promote the development of the LLMs. If you have any questions or needs, please feel free to [contact us](mailto:opencompass@pjlab.org.cn). In addition, relevant evaluation contents, performance statistics, and evaluation methods will be open-source along with the leaderboard release.
Let's look forward to the release of the OpenCompass 2023 LLM Annual Leaderboard!
<details>
<summary><kbd>Star History</kbd></summary>
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=open-compass%2Fopencompass&theme=dark&type=Date">
<img width="100%" src="https://api.star-history.com/svg?repos=open-compass%2Fopencompass&type=Date">
</picture>
</details>
## 🧭 Welcome
@ -44,24 +53,232 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
🔥🔥🔥 We are delighted to announce that **the OpenCompass has been recommended by the Meta AI**, click [Get Started](https://ai.meta.com/llama/get-started/#validation) of Llama for more information.
> **Attention**<br />
> We launch the OpenCompass Collaboration project, welcome to support diverse evaluation benchmarks into OpenCompass!
> Clike [Issue](https://github.com/open-compass/opencompass/issues/248) for more information.
> Let's work together to build a more powerful OpenCompass toolkit!
> Breaking Change Notice: In version 0.4.0, we are consolidating all AMOTIC configuration files (previously located in ./configs/datasets, ./configs/models, and ./configs/summarizers) into the opencompass package. Users are advised to update their configuration references to reflect this structural change.
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2023.11.22\]** We have supported many API-based models, include **Baidu, ByteDance, Huawei, 360**. Welcome to [Models](https://opencompass.readthedocs.io/en/latest/user_guides/models.html) section for more details. 🔥🔥🔥.
- **\[2023.11.20\]** Thanks [helloyongyang](https://github.com/helloyongyang) for supporting the evaluation with [LightLLM](https://github.com/ModelTC/lightllm) as backent. Welcome to [Evaluation With LightLLM](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_lightllm.html) for more details. 🔥🔥🔥.
- **\[2023.11.13\]** We are delighted to announce the release of OpenCompass v0.1.8. This version enables local loading of evaluation benchmarks, thereby eliminating the need for an internet connection. Please note that with this update, **you must re-download all evaluation datasets** to ensure accurate and up-to-date results.🔥🔥🔥.
- **\[2023.11.06\]** We have supported several API-based models, include **ChatGLM Pro@Zhipu, ABAB-Chat@MiniMax and Xunfei**. Welcome to [Models](https://opencompass.readthedocs.io/en/latest/user_guides/models.html) section for more details. 🔥🔥🔥.
- **\[2023.10.24\]** We release a new benchmark for evaluating LLMs capabilities of having multi-turn dialogues. Welcome to [BotChat](https://github.com/open-compass/BotChat) for more details.
- **\[2023.09.26\]** We update the leaderboard with [Qwen](https://github.com/QwenLM/Qwen), one of the best-performing open-source models currently available, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.20\]** We update the leaderboard with [InternLM-20B](https://github.com/InternLM/InternLM), welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.19\]** We update the leaderboard with WeMix-LLaMA2-70B/Phi-1.5-1.3B, welcome to our [homepage](https://opencompass.org.cn) for more details.
- **\[2023.09.18\]** We have released [long context evaluation guidance](docs/en/advanced_guides/longeval.md).
- **\[2025.04.01\]** OpenCompass now supports `CascadeEvaluator`, a flexible evaluation mechanism that allows multiple evaluators to work in sequence. This enables creating customized evaluation pipelines for complex assessment scenarios. Check out the [documentation](docs/en/advanced_guides/llm_judge.md) for more details! 🔥🔥🔥
- **\[2025.03.11\]** We have supported evaluation for `SuperGPQA` which is a great benchmark for measuring LLM knowledge ability 🔥🔥🔥
- **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHVerifyEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
- **\[2025.01.16\]** We now support the [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) model which has enhanced performance on reasoning and knowledge-intensive tasks.
- **\[2024.12.17\]** We have provided the evaluation script for the December [CompassAcademic](examples/eval_academic_leaderboard_202412.py), which allows users to easily reproduce the official evaluation results by configuring it.
- **\[2024.11.14\]** OpenCompass now offers support for a sophisticated benchmark designed to evaluate complex reasoning skills — [MuSR](https://arxiv.org/pdf/2310.16049). Check out the [demo](examples/eval_musr.py) and give it a spin! 🔥🔥🔥
- **\[2024.11.14\]** OpenCompass now supports the brand new long-context language model evaluation benchmark — [BABILong](https://arxiv.org/pdf/2406.10149). Have a look at the [demo](examples/eval_babilong.py) and give it a try! 🔥🔥🔥
- **\[2024.10.14\]** We now support the OpenAI multilingual QA dataset [MMMLU](https://huggingface.co/datasets/openai/MMMLU). Feel free to give it a try! 🔥🔥🔥
- **\[2024.09.19\]** We now support [Qwen2.5](https://huggingface.co/Qwen)(0.5B to 72B) with multiple backend(huggingface/vllm/lmdeploy). Feel free to give them a try! 🔥🔥🔥
- **\[2024.09.17\]** We now support OpenAI o1(`o1-mini-2024-09-12` and `o1-preview-2024-09-12`). Feel free to give them a try! 🔥🔥🔥
- **\[2024.09.05\]** We now support answer extraction through model post-processing to provide a more accurate representation of the model's capabilities. As part of this update, we have integrated [XFinder](https://github.com/IAAR-Shanghai/xFinder) as our first post-processing model. For more detailed information, please refer to the [documentation](opencompass/utils/postprocessors/xfinder/README.md), and give it a try! 🔥🔥🔥
- **\[2024.08.20\]** OpenCompass now supports the [SciCode](https://github.com/scicode-bench/SciCode): A Research Coding Benchmark Curated by Scientists. 🔥🔥🔥
- **\[2024.08.16\]** OpenCompass now supports the brand new long-context language model evaluation benchmark — [RULER](https://arxiv.org/pdf/2404.06654). RULER provides an evaluation of long-context including retrieval, multi-hop tracing, aggregation, and question answering through flexible configurations. Check out the [RULER](configs/datasets/ruler/README.md) evaluation config now! 🔥🔥🔥
- **\[2024.08.09\]** We have released the example data and configuration for the CompassBench-202408, welcome to [CompassBench](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/compassbench_intro.html) for more details. 🔥🔥🔥
- **\[2024.08.01\]** We supported the [Gemma2](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315) models. Welcome to try! 🔥🔥🔥
- **\[2024.07.23\]** We supported the [ModelScope](www.modelscope.cn) datasets, you can load them on demand without downloading all the data to your local disk. Welcome to try! 🔥🔥🔥
- **\[2024.07.17\]** We are excited to announce the release of NeedleBench's [technical report](http://arxiv.org/abs/2407.11963). We invite you to visit our [support documentation](https://opencompass.readthedocs.io/en/latest/advanced_guides/needleinahaystack_eval.html) for detailed evaluation guidelines. 🔥🔥🔥
- **\[2024.07.04\]** OpenCompass now supports InternLM2.5, which has **outstanding reasoning capability**, **1M Context window and** and **stronger tool use**, you can try the models in [OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) and [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
- **\[2024.06.20\]** OpenCompass now supports one-click switching between inference acceleration backends, enhancing the efficiency of the evaluation process. In addition to the default HuggingFace inference backend, it now also supports popular backends [LMDeploy](https://github.com/InternLM/lmdeploy) and [vLLM](https://github.com/vllm-project/vllm). This feature is available via a simple command-line switch and through deployment APIs. For detailed usage, see the [documentation](docs/en/advanced_guides/accelerator_intro.md).🔥🔥🔥.
> [More](docs/en/notes/news.md)
## 📊 Leaderboard
We provide [OpenCompass Leaderboard](https://rank.opencompass.org.cn/home) for the community to rank all public models and API models. If you would like to join the evaluation, please provide the model repository URL or a standard API interface to the email address `opencompass@pjlab.org.cn`.
You can also refer to [CompassAcademic](configs/eval_academic_leaderboard_202412.py) to quickly reproduce the leaderboard results. The currently selected datasets include Knowledge Reasoning (MMLU-Pro/GPQA Diamond), Logical Reasoning (BBH), Mathematical Reasoning (MATH-500, AIME), Code Generation (LiveCodeBench, HumanEval), and Instruction Following (IFEval)."
<p align="right"><a href="#top">🔝Back to top</a></p>
## 🛠️ Installation
Below are the steps for quick installation and datasets preparation.
### 💻 Environment Setup
We highly recommend using conda to manage your python environment.
- #### Create your virtual environment
```bash
conda create --name opencompass python=3.10 -y
conda activate opencompass
```
- #### Install OpenCompass via pip
```bash
pip install -U opencompass
## Full installation (with support for more datasets)
# pip install "opencompass[full]"
## Environment with model acceleration frameworks
## Manage different acceleration frameworks using virtual environments
## since they usually have dependency conflicts with each other.
# pip install "opencompass[lmdeploy]"
# pip install "opencompass[vllm]"
## API evaluation (i.e. Openai, Qwen)
# pip install "opencompass[api]"
```
- #### Install OpenCompass from source
If you want to use opencompass's latest features, or develop new features, you can also build it from source
```bash
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
# pip install -e ".[full]"
# pip install -e ".[vllm]"
```
### 📂 Data Preparation
You can choose one for the following method to prepare datasets.
#### Offline Preparation
You can download and extract the datasets with the following commands:
```bash
# Download dataset to data/ folder
wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-20240207.zip
unzip OpenCompassData-core-20240207.zip
```
#### Automatic Download from OpenCompass
We have supported download datasets automatic from the OpenCompass storage server. You can run the evaluation with extra `--dry-run` to download these datasets.
Currently, the supported datasets are listed in [here](https://github.com/open-compass/opencompass/blob/main/opencompass/utils/datasets_info.py#L259). More datasets will be uploaded recently.
#### (Optional) Automatic Download with ModelScope
Also you can use the [ModelScope](www.modelscope.cn) to load the datasets on demand.
Installation:
```bash
pip install modelscope[framework]
export DATASET_SOURCE=ModelScope
```
Then submit the evaluation task without downloading all the data to your local disk. Available datasets include:
```bash
humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ceval, math, LCSTS, Xsum, winogrande, openbookqa, AGIEval, gsm8k, nq, race, siqa, mbpp, mmlu, hellaswag, ARC, BBH, xstory_cloze, summedits, GAOKAO-BENCH, OCNLI, cmnli
```
Some third-party features, like Humaneval and Llama, may require additional steps to work properly, for detailed steps please refer to the [Installation Guide](https://opencompass.readthedocs.io/en/latest/get_started/installation.html).
<p align="right"><a href="#top">🔝Back to top</a></p>
## 🏗️ Evaluation
After ensuring that OpenCompass is installed correctly according to the above steps and the datasets are prepared. Now you can start your first evaluation using OpenCompass!
### Your first evaluation with OpenCompass!
OpenCompass support setting your configs via CLI or a python script. For simple evaluation settings we recommend using CLI, for more complex evaluation, it is suggested using the script way. You can find more example scripts under the configs folder.
```bash
# CLI
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen
# Python scripts
opencompass examples/eval_chat_demo.py
```
You can find more script examples under [examples](./examples) folder.
### API evaluation
OpenCompass, by its design, does not really discriminate between open-source models and API models. You can evaluate both model types in the same way or even in one settings.
```bash
export OPENAI_API_KEY="YOUR_OPEN_API_KEY"
# CLI
opencompass --models gpt_4o_2024_05_13 --datasets demo_gsm8k_chat_gen
# Python scripts
opencompass examples/eval_api_demo.py
# You can use o1_mini_2024_09_12/o1_preview_2024_09_12 for o1 models, we set max_completion_tokens=8192 as default.
```
### Accelerated Evaluation
Additionally, if you want to use an inference backend other than HuggingFace for accelerated evaluation, such as LMDeploy or vLLM, you can do so with the command below. Please ensure that you have installed the necessary packages for the chosen backend and that your model supports accelerated inference with it. For more information, see the documentation on inference acceleration backends [here](docs/en/advanced_guides/accelerator_intro.md). Below is an example using LMDeploy:
```bash
# CLI
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy
# Python scripts
opencompass examples/eval_lmdeploy_demo.py
```
### Supported Models and Datasets
OpenCompass has predefined configurations for many models and datasets. You can list all available model and dataset configurations using the [tools](./docs/en/tools.md#list-configs).
```bash
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
```
#### Supported Models
If the model is not on the list but supported by Huggingface AutoModel class or encapsulation of inference engine based on OpenAI interface (see [docs](https://opencompass.readthedocs.io/en/latest/advanced_guides/new_model.html) for details), you can also evaluate it with OpenCompass. You are welcome to contribute to the maintenance of the OpenCompass supported model and dataset lists.
```bash
opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat
```
#### Supported Datasets
Currently, OpenCompass have provided standard recommended configurations for datasets. Generally, config files ending with `_gen.py` or `_llm_judge_gen.py` will point to the recommended config we provide for this dataset. You can refer to [docs](https://opencompass.readthedocs.io/en/latest/dataset_statistics.html) for more details.
```bash
# Recommended Evaluation Config based on Rules
opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
# Recommended Evaluation Config based on LLM Judge
opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
```
If you want to use multiple GPUs to evaluate the model in data parallel, you can use `--max-num-worker`.
```bash
CUDA_VISIBLE_DEVICES=0,1 opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat --max-num-worker 2
```
> \[!TIP\]
>
> `--hf-num-gpus` is used for model parallel(huggingface format), `--max-num-worker` is used for data parallel.
> \[!TIP\]
>
> configuration with `_ppl` is designed for base model typically.
> configuration with `_gen` can be used for both base model and chat model.
Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html) to learn how to run an evaluation task.
<p align="right"><a href="#top">🔝Back to top</a></p>
## 📣 OpenCompass 2.0
We are thrilled to introduce OpenCompass 2.0, an advanced suite featuring three key components: [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home).
![oc20](https://github.com/tonysy/opencompass/assets/7881589/90dbe1c0-c323-470a-991e-2b37ab5350b2)
**CompassRank** has been significantly enhanced into the leaderboards that now incorporates both open-source benchmarks and proprietary benchmarks. This upgrade allows for a more comprehensive evaluation of models across the industry.
**CompassHub** presents a pioneering benchmark browser interface, designed to simplify and expedite the exploration and utilization of an extensive array of benchmarks for researchers and practitioners alike. To enhance the visibility of your own benchmark within the community, we warmly invite you to contribute it to CompassHub. You may initiate the submission process by clicking [here](https://hub.opencompass.org.cn/dataset-submit).
**CompassKit** is a powerful collection of evaluation toolkits specifically tailored for Large Language Models and Large Vision-language Models. It provides an extensive set of tools to assess and measure the performance of these complex models effectively. Welcome to try our toolkits for in your research and products.
## ✨ Introduction
![image](https://github.com/open-compass/opencompass/assets/22607038/f45fe125-4aed-4f8c-8fe8-df4efb41a8ea)
@ -78,350 +295,15 @@ OpenCompass is a one-stop platform for large model evaluation, aiming to provide
- **Experiment management and reporting mechanism**: Use config files to fully record each experiment, and support real-time reporting of results.
## 📊 Leaderboard
We provide [OpenCompass Leaderboard](https://opencompass.org.cn/rank) for the community to rank all public models and API models. If you would like to join the evaluation, please provide the model repository URL or a standard API interface to the email address `opencompass@pjlab.org.cn`.
<p align="right"><a href="#top">🔝Back to top</a></p>
## 🛠️ Installation
Below are the steps for quick installation and datasets preparation.
### 💻 Environment Setup
#### Open-source Models with GPU
```bash
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
```
#### API Models with CPU-only
```bash
conda create -n opencompass python=3.10 pytorch torchvision torchaudio cpuonly -c pytorch -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
# also please install requiresments packages via `pip install -r requirements/api.txt` for API models if needed.
```
### 📂 Data Preparation
```bash
# Download dataset to data/ folder
wget https://github.com/open-compass/opencompass/releases/download/0.1.8.rc1/OpenCompassData-core-20231110.zip
unzip OpenCompassData-core-20231110.zip
```
Some third-party features, like Humaneval and Llama, may require additional steps to work properly, for detailed steps please refer to the [Installation Guide](https://opencompass.readthedocs.io/en/latest/get_started/installation.html).
<p align="right"><a href="#top">🔝Back to top</a></p>
## 🏗️ Evaluation
After ensuring that OpenCompass is installed correctly according to the above steps and the datasets are prepared, you can evaluate the performance of the LLaMA-7b model on the MMLU and C-Eval datasets using the following command:
```bash
python run.py --models hf_llama_7b --datasets mmlu_ppl ceval_ppl
```
OpenCompass has predefined configurations for many models and datasets. You can list all available model and dataset configurations using the [tools](./docs/en/tools.md#list-configs).
```bash
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
```
You can also evaluate other HuggingFace models via command line. Taking LLaMA-7b as an example:
```bash
python run.py --datasets ceval_ppl mmlu_ppl \
--hf-path huggyllama/llama-7b \ # HuggingFace model path
--model-kwargs device_map='auto' \ # Arguments for model construction
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \ # Arguments for tokenizer construction
--max-out-len 100 \ # Maximum number of tokens generated
--max-seq-len 2048 \ # Maximum sequence length the model can accept
--batch-size 8 \ # Batch size
--no-batch-padding \ # Don't enable batch padding, infer through for loop to avoid performance loss
--num-gpus 1 # Number of minimum required GPUs
```
> **Note**<br />
> To run the command above, you will need to remove the comments starting from `# ` first.
Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html) to learn how to run an evaluation task.
<p align="right"><a href="#top">🔝Back to top</a></p>
## 📖 Dataset Support
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>Language</b>
</td>
<td>
<b>Knowledge</b>
</td>
<td>
<b>Reasoning</b>
</td>
<td>
<b>Examination</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>Word Definition</b></summary>
We have supported a statistical list of all datasets that can be used on this platform in the documentation on the OpenCompass website.
- WiC
- SummEdits
You can quickly find the dataset you need from the list through sorting, filtering, and searching functions.
</details>
In addition, we provide a recommended configuration for each dataset, and some datasets also support LLM Judge-based configurations.
<details open>
<summary><b>Idiom Learning</b></summary>
- CHID
</details>
<details open>
<summary><b>Semantic Similarity</b></summary>
- AFQMC
- BUSTM
</details>
<details open>
<summary><b>Coreference Resolution</b></summary>
- CLUEWSC
- WSC
- WinoGrande
</details>
<details open>
<summary><b>Translation</b></summary>
- Flores
- IWSLT2017
</details>
<details open>
<summary><b>Multi-language Question Answering</b></summary>
- TyDi-QA
- XCOPA
</details>
<details open>
<summary><b>Multi-language Summary</b></summary>
- XLSum
</details>
</td>
<td>
<details open>
<summary><b>Knowledge Question Answering</b></summary>
- BoolQ
- CommonSenseQA
- NaturalQuestions
- TriviaQA
</details>
</td>
<td>
<details open>
<summary><b>Textual Entailment</b></summary>
- CMNLI
- OCNLI
- OCNLI_FC
- AX-b
- AX-g
- CB
- RTE
- ANLI
</details>
<details open>
<summary><b>Commonsense Reasoning</b></summary>
- StoryCloze
- COPA
- ReCoRD
- HellaSwag
- PIQA
- SIQA
</details>
<details open>
<summary><b>Mathematical Reasoning</b></summary>
- MATH
- GSM8K
</details>
<details open>
<summary><b>Theorem Application</b></summary>
- TheoremQA
- StrategyQA
- SciBench
</details>
<details open>
<summary><b>Comprehensive Reasoning</b></summary>
- BBH
</details>
</td>
<td>
<details open>
<summary><b>Junior High, High School, University, Professional Examinations</b></summary>
- C-Eval
- AGIEval
- MMLU
- GAOKAO-Bench
- CMMLU
- ARC
- Xiezhi
</details>
<details open>
<summary><b>Medical Examinations</b></summary>
- CMB
</details>
</td>
</tr>
</td>
</tr>
</tbody>
<tbody>
<tr align="center" valign="bottom">
<td>
<b>Understanding</b>
</td>
<td>
<b>Long Context</b>
</td>
<td>
<b>Safety</b>
</td>
<td>
<b>Code</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>Reading Comprehension</b></summary>
- C3
- CMRC
- DRCD
- MultiRC
- RACE
- DROP
- OpenBookQA
- SQuAD2.0
</details>
<details open>
<summary><b>Content Summary</b></summary>
- CSL
- LCSTS
- XSum
- SummScreen
</details>
<details open>
<summary><b>Content Analysis</b></summary>
- EPRSTMT
- LAMBADA
- TNEWS
</details>
</td>
<td>
<details open>
<summary><b>Long Context Understanding</b></summary>
- LEval
- LongBench
- GovReports
- NarrativeQA
- Qasper
</details>
</td>
<td>
<details open>
<summary><b>Safety</b></summary>
- CivilComments
- CrowsPairs
- CValues
- JigsawMultilingual
- TruthfulQA
</details>
<details open>
<summary><b>Robustness</b></summary>
- AdvGLUE
</details>
</td>
<td>
<details open>
<summary><b>Code</b></summary>
- HumanEval
- HumanEvalX
- MBPP
- APPs
- DS1000
</details>
</td>
</tr>
</td>
</tr>
</tbody>
</table>
## OpenCompass Ecosystem
Please refer to the dataset statistics chapter of [docs](https://opencompass.readthedocs.io/en/latest/dataset_statistics.html) for details.
<p align="right"><a href="#top">🔝Back to top</a></p>
@ -443,23 +325,27 @@ Through the command line or configuration files, OpenCompass also supports evalu
<tr valign="top">
<td>
- [InternLM](https://github.com/InternLM/InternLM)
- [LLaMA](https://github.com/facebookresearch/llama)
- [Vicuna](https://github.com/lm-sys/FastChat)
- [Alpaca](https://github.com/tatsu-lab/stanford_alpaca)
- [Baichuan](https://github.com/baichuan-inc)
- [WizardLM](https://github.com/nlpxucan/WizardLM)
- [BlueLM](https://github.com/vivo-ai-lab/BlueLM)
- [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B)
- [ChatGLM3](https://github.com/THUDM/ChatGLM3-6B)
- [TigerBot](https://github.com/TigerResearch/TigerBot)
- [Gemma](https://huggingface.co/google/gemma-7b)
- [InternLM](https://github.com/InternLM/InternLM)
- [LLaMA](https://github.com/facebookresearch/llama)
- [LLaMA3](https://github.com/meta-llama/llama3)
- [Qwen](https://github.com/QwenLM/Qwen)
- [BlueLM](https://github.com/vivo-ai-lab/BlueLM)
- ...
- [TigerBot](https://github.com/TigerResearch/TigerBot)
- [Vicuna](https://github.com/lm-sys/FastChat)
- [WizardLM](https://github.com/nlpxucan/WizardLM)
- [Yi](https://github.com/01-ai/Yi)
- ……
</td>
<td>
- OpenAI
- Gemini
- Claude
- ZhipuAI(ChatGLM)
- Baichuan
@ -482,25 +368,39 @@ Through the command line or configuration files, OpenCompass also supports evalu
## 🔜 Roadmap
- [ ] Subjective Evaluation
- [ ] Release CompassAreana
- [ ] Subjective evaluation dataset.
- [x] Subjective Evaluation
- [x] Release CompassAreana.
- [x] Subjective evaluation.
- [x] Long-context
- [ ] Long-context evaluation with extensive datasets.
- [x] Long-context evaluation with extensive datasets.
- [ ] Long-context leaderboard.
- [ ] Coding
- [x] Coding
- [ ] Coding evaluation leaderboard.
- [x] Non-python language evaluation service.
- [ ] Agent
- [ ] Support various agenet framework.
- [ ] Evaluation of tool use of the LLMs.
- [x] Agent
- [ ] Support various agent frameworks.
- [x] Evaluation of tool use of the LLMs.
- [x] Robustness
- [x] Support various attack method
- [x] Support various attack methods.
## 👷‍♂️ Contributing
We appreciate all contributions to improving OpenCompass. Please refer to the [contributing guideline](https://opencompass.readthedocs.io/en/latest/notes/contribution_guide.html) for the best practice.
<!-- Copy-paste in your Readme.md file -->
<!-- Made with [OSS Insight](https://ossinsight.io/) -->
<a href="https://github.com/open-compass/opencompass/graphs/contributors" target="_blank">
<table>
<tr>
<th colspan="2">
<br><img src="https://contrib.rocks/image?repo=open-compass/opencompass"><br><br>
</th>
</tr>
</table>
</a>
## 🤝 Acknowledgements
Some code in this project is cited and modified from [OpenICL](https://github.com/Shark-NLP/OpenICL).
@ -519,3 +419,20 @@ Some datasets and prompt implementations are modified from [chain-of-thought-hub
```
<p align="right"><a href="#top">🔝Back to top</a></p>
[github-contributors-link]: https://github.com/open-compass/opencompass/graphs/contributors
[github-contributors-shield]: https://img.shields.io/github/contributors/open-compass/opencompass?color=c4f042&labelColor=black&style=flat-square
[github-forks-link]: https://github.com/open-compass/opencompass/network/members
[github-forks-shield]: https://img.shields.io/github/forks/open-compass/opencompass?color=8ae8ff&labelColor=black&style=flat-square
[github-issues-link]: https://github.com/open-compass/opencompass/issues
[github-issues-shield]: https://img.shields.io/github/issues/open-compass/opencompass?color=ff80eb&labelColor=black&style=flat-square
[github-license-link]: https://github.com/open-compass/opencompass/blob/main/LICENSE
[github-license-shield]: https://img.shields.io/github/license/open-compass/opencompass?color=white&labelColor=black&style=flat-square
[github-release-link]: https://github.com/open-compass/opencompass/releases
[github-release-shield]: https://img.shields.io/github/v/release/open-compass/opencompass?color=369eff&labelColor=black&logo=github&style=flat-square
[github-releasedate-link]: https://github.com/open-compass/opencompass/releases
[github-releasedate-shield]: https://img.shields.io/github/release-date/open-compass/opencompass?labelColor=black&style=flat-square
[github-stars-link]: https://github.com/open-compass/opencompass/stargazers
[github-stars-shield]: https://img.shields.io/github/stars/open-compass/opencompass?color=ffcb47&labelColor=black&style=flat-square
[github-trending-shield]: https://trendshift.io/api/badge/repositories/6630
[github-trending-url]: https://trendshift.io/repositories/6630

View File

@ -3,35 +3,44 @@
<br />
<br />
[![docs](https://readthedocs.org/projects/opencompass/badge)](https://opencompass.readthedocs.io/zh_CN)
[![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](https://github.com/open-compass/opencompass/blob/main/LICENSE)
[![][github-release-shield]][github-release-link]
[![][github-releasedate-shield]][github-releasedate-link]
[![][github-contributors-shield]][github-contributors-link]<br>
[![][github-forks-shield]][github-forks-link]
[![][github-stars-shield]][github-stars-link]
[![][github-issues-shield]][github-issues-link]
[![][github-license-shield]][github-license-link]
<!-- [![PyPI](https://badge.fury.io/py/opencompass.svg)](https://pypi.org/project/opencompass/) -->
[🌐Website](https://opencompass.org.cn/) |
[📘Documentation](https://opencompass.readthedocs.io/zh_CN/latest/index.html) |
[🛠Installation](https://opencompass.readthedocs.io/zh_CN/latest/get_started/installation.html) |
[🤔Reporting Issues](https://github.com/open-compass/opencompass/issues/new/choose)
[🌐官方网站](https://opencompass.org.cn/) |
[📖数据集社区](https://hub.opencompass.org.cn/home) |
[📊性能榜单](https://rank.opencompass.org.cn/home) |
[📘文档教程](https://opencompass.readthedocs.io/zh_CN/latest/index.html) |
[🛠️安装](https://opencompass.readthedocs.io/zh_CN/latest/get_started/installation.html) |
[🤔报告问题](https://github.com/open-compass/opencompass/issues/new/choose)
[English](/README.md) | 简体中文
[![][github-trending-shield]][github-trending-url]
</div>
<p align="center">
👋 加入我们的 <a href="https://discord.gg/KKwfEbFj7U" target="_blank">Discord</a><a href="https://r.vansin.top/?r=opencompass" target="_blank">微信社区</a>
</p>
## 📣 2023 年度榜单计划
> \[!IMPORTANT\]
>
> **收藏项目**,你将能第一时间获取 OpenCompass 的最新动态~⭐️
我们有幸与社区共同见证了通用人工智能在过去一年里的巨大进展也非常高兴OpenCompass能够帮助广大大模型开发者和使用者。
我们宣布将启动**OpenCompass 2023年度大模型榜单**发布计划。我们预计将于2024年1月发布大模型年度榜单系统性评估大模型在语言、知识、推理、创作、长文本和智能体等多个能力维度的表现。
届时我们将发布开源模型和商业API模型能力榜单以期为业界提供一份**全面、客观、中立**的参考。
我们诚挚邀请各类大模型接入OpenCompass评测体系以展示其在各个领域的性能优势。同时也欢迎广大研究者、开发者向我们提供宝贵的意见和建议共同推动大模型领域的发展。如有任何问题或需求请随时[联系我们](mailto:opencompass@pjlab.org.cn)。此外,相关评测内容,性能数据,评测方法也将随榜单发布一并开源。
<p>让我们共同期待OpenCompass 2023年度大模型榜单的发布期待各大模型在榜单上的精彩表现</p>
<details>
<summary><kbd>Star History</kbd></summary>
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=open-compass%2Fopencompass&theme=dark&type=Date">
<img width="100%" src="https://api.star-history.com/svg?repos=open-compass%2Fopencompass&type=Date">
</picture>
</details>
## 🧭 欢迎
@ -41,27 +50,228 @@
🚩🚩🚩 欢迎加入 OpenCompass我们目前**招聘全职研究人员/工程师和实习生**。如果您对 LLM 和 OpenCompass 充满热情,请随时通过[电子邮件](mailto:zhangsongyang@pjlab.org.cn)与我们联系。我们非常期待与您交流!
🔥🔥🔥 祝贺 **OpenCompass 作为大模型标准测试工具被Meta AI官方推荐**, 点击 Llama 的 [入门文档](https://ai.meta.com/llama/get-started/#validation) 获取更多信息.
🔥🔥🔥 祝贺 **OpenCompass 作为大模型标准测试工具被Meta AI官方推荐**, 点击 Llama 的 [入门文档](https://ai.meta.com/llama/get-started/#validation) 获取更多信息
> **注意**<br />
> 我们正式启动 OpenCompass 共建计划,诚邀社区用户为 OpenCompass 提供更具代表性和可信度的客观评测数据集!
> 点击 [Issue](https://github.com/open-compass/opencompass/issues/248) 获取更多数据集.
> 让我们携手共进,打造功能强大易用的大模型评测平台!
> 重要通知:从 v0.4.0 版本开始,所有位于 ./configs/datasets、./configs/models 和 ./configs/summarizers 目录下的 AMOTIC 配置文件将迁移至 opencompass 包中。请及时更新您的配置文件路径。
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2023.11.22\]** 我们已经支持了多个于API的模型包括**百度、字节跳动、华为、360**。欢迎查阅[模型](https://opencompass.readthedocs.io/en/latest/user_guides/models.html)部分以获取更多详细信息。🔥🔥🔥。
- **\[2023.11.20\]** 感谢[helloyongyang](https://github.com/helloyongyang)支持使用[LightLLM](https://github.com/ModelTC/lightllm)作为后端进行评估。欢迎查阅[使用LightLLM进行评估](https://opencompass.readthedocs.io/en/latest/advanced_guides/evaluation_lightllm.html)以获取更多详细信息。🔥🔥🔥。
- **\[2023.11.13\]** 我们很高兴地宣布发布 OpenCompass v0.1.8 版本。此版本支持本地加载评估基准,从而无需连接互联网。请注意,随着此更新的发布,**您需要重新下载所有评估数据集**,以确保结果准确且最新。🔥🔥🔥。
- **\[2023.11.06\]** 我们已经支持了多个基于 API 的模型包括ChatGLM Pro@智谱清言、ABAB-Chat@MiniMax 和讯飞。欢迎查看 [模型](https://opencompass.readthedocs.io/en/latest/user_guides/models.html) 部分以获取更多详细信息。🔥🔥🔥。
- **\[2023.10.24\]** 我们发布了一个全新的评测集BotChat用于评估大语言模型的多轮对话能力欢迎查看 [BotChat](https://github.com/open-compass/BotChat) 获取更多信息.
- **\[2023.09.26\]** 我们在评测榜单上更新了[Qwen](https://github.com/QwenLM/Qwen), 这是目前表现最好的开源模型之一, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.
- **\[2023.09.20\]** 我们在评测榜单上更新了[InternLM-20B](https://github.com/InternLM/InternLM), 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.
- **\[2023.09.19\]** 我们在评测榜单上更新了WeMix-LLaMA2-70B/Phi-1.5-1.3B, 欢迎访问[官方网站](https://opencompass.org.cn)获取详情.
- **\[2023.09.18\]** 我们发布了[长文本评测指引](docs/zh_cn/advanced_guides/longeval.md).
- **\[2025.04.01\]** OpenCompass 现已支持 `CascadeEvaluator`,允许多个评估器按顺序工作,可以为更复杂的评估场景创建自定义评估流程,查看[文档](docs/zh_cn/advanced_guides/llm_judge.md)了解具体用法!🔥🔥🔥
- **\[2025.03.11\]** 现已支持 `SuperGPQA` 覆盖285 个研究生学科的知识能力评测,欢迎尝试!🔥🔥🔥
- **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程,请查看 [评估推理模型](docs/zh_cn/user_guides/deepseek_r1.md) 了解更多详情!🔥🔥🔥
- **\[2025.02.15\]** 我们新增了两个实用的评测工具用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHVerifyEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情!🔥🔥🔥
- **\[2025.01.16\]** 我们现已支持 [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) 模型,该模型在推理、知识类任务上取得同量级最优性能,欢迎尝试。
- **\[2024.12.17\]** 我们提供了12月CompassAcademic学术榜单评估脚本 [CompassAcademic](configs/eval_academic_leaderboard_202412.py),你可以通过简单地配置复现官方评测结果。
- **\[2024.10.14\]** 现已支持OpenAI多语言问答数据集[MMMLU](https://huggingface.co/datasets/openai/MMMLU),欢迎尝试! 🔥🔥🔥
- **\[2024.09.19\]** 现已支持[Qwen2.5](https://huggingface.co/Qwen)(0.5B to 72B) ,可以使用多种推理后端(huggingface/vllm/lmdeploy), 欢迎尝试! 🔥🔥🔥
- **\[2024.09.05\]** 现已支持OpenAI o1 模型(`o1-mini-2024-09-12` and `o1-preview-2024-09-12`), 欢迎尝试! 🔥🔥🔥
- **\[2024.09.05\]** OpenCompass 现在支持通过模型后处理来进行答案提取,以更准确地展示模型的能力。作为此次更新的一部分,我们集成了 [XFinder](https://github.com/IAAR-Shanghai/xFinder) 作为首个后处理模型。具体信息请参阅 [文档](opencompass/utils/postprocessors/xfinder/README.md),欢迎尝试! 🔥🔥🔥
- **\[2024.08.20\]** OpenCompass 现已支持 [SciCode](https://github.com/scicode-bench/SciCode): A Research Coding Benchmark Curated by Scientists。 🔥🔥🔥
- **\[2024.08.16\]** OpenCompass 现已支持全新的长上下文语言模型评估基准——[RULER](https://arxiv.org/pdf/2404.06654)。RULER 通过灵活的配置,提供了对长上下文包括检索、多跳追踪、聚合和问答等多种任务类型的评测,欢迎访问[RULER](configs/datasets/ruler/README.md)。🔥🔥🔥
- **\[2024.07.23\]** 我们支持了[Gemma2](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315)模型,欢迎试用!🔥🔥🔥
- **\[2024.07.23\]** 我们支持了[ModelScope](www.modelscope.cn)数据集,您可以按需加载,无需事先下载全部数据到本地,欢迎试用!🔥🔥🔥
- **\[2024.07.17\]** 我们发布了CompassBench-202407榜单的示例数据和评测规则敬请访问 [CompassBench](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/compassbench_intro.html) 获取更多信息。 🔥🔥🔥
- **\[2024.07.17\]** 我们正式发布 NeedleBench 的[技术报告](http://arxiv.org/abs/2407.11963)。诚邀您访问我们的[帮助文档](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/needleinahaystack_eval.html)进行评估。🔥🔥🔥
- **\[2024.07.04\]** OpenCompass 现已支持 InternLM2.5 它拥有卓越的推理性能、有效支持百万字超长上下文以及工具调用能力整体升级,欢迎访问[OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) 和 [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
- **\[2024.06.20\]** OpenCompass 现已支持一键切换推理加速后端助力评测过程更加高效。除了默认的HuggingFace推理后端外还支持了常用的 [LMDeploy](https://github.com/InternLM/lmdeploy) 和 [vLLM](https://github.com/vllm-project/vllm) ,支持命令行一键切换和部署 API 加速服务两种方式,详细使用方法见[文档](docs/zh_cn/advanced_guides/accelerator_intro.md)。欢迎试用!🔥🔥🔥.
> [更多](docs/zh_cn/notes/news.md)
## 📊 性能榜单
我们将陆续提供开源模型和 API 模型的具体性能榜单,请见 [OpenCompass Leaderboard](https://rank.opencompass.org.cn/home) 。如需加入评测,请提供模型仓库地址或标准的 API 接口至邮箱 `opencompass@pjlab.org.cn`.
你也可以参考[CompassAcademic](configs/eval_academic_leaderboard_202412.py),快速地复现榜单的结果,目前选取的数据集包括 综合知识推理 (MMLU-Pro/GPQA Diamond) ,逻辑推理 (BBH) ,数学推理 (MATH-500, AIME) ,代码生成 (LiveCodeBench, HumanEval) ,指令跟随 (IFEval) 。
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 🛠️ 安装指南
下面提供了快速安装和数据集准备的步骤。
### 💻 环境搭建
我们强烈建议使用 `conda` 来管理您的 Python 环境。
- #### 创建虚拟环境
```bash
conda create --name opencompass python=3.10 -y
conda activate opencompass
```
- #### 通过pip安装OpenCompass
```bash
# 支持绝大多数数据集及模型
pip install -U opencompass
# 完整安装(支持更多数据集)
# pip install "opencompass[full]"
# 模型推理后端,由于这些推理后端通常存在依赖冲突,建议使用不同的虚拟环境来管理它们。
# pip install "opencompass[lmdeploy]"
# pip install "opencompass[vllm]"
# API 测试(例如 OpenAI、Qwen
# pip install "opencompass[api]"
```
- #### 基于源码安装OpenCompass
如果希望使用 OpenCompass 的最新功能,也可以从源代码构建它:
```bash
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
# pip install -e ".[full]"
# pip install -e ".[vllm]"
```
### 📂 数据准备
#### 提前离线下载
OpenCompass支持使用本地数据集进行评测数据集的下载和解压可以通过以下命令完成
```bash
# 下载数据集到 data/ 处
wget https://github.com/open-compass/opencompass/releases/download/0.2.2.rc1/OpenCompassData-core-20240207.zip
unzip OpenCompassData-core-20240207.zip
```
#### 从 OpenCompass 自动下载
我们已经支持从OpenCompass存储服务器自动下载数据集。您可以通过额外的 `--dry-run` 参数来运行评估以下载这些数据集。
目前支持的数据集列表在[这里](https://github.com/open-compass/opencompass/blob/main/opencompass/utils/datasets_info.py#L259)。更多数据集将会很快上传。
#### (可选) 使用 ModelScope 自动下载
另外,您还可以使用[ModelScope](www.modelscope.cn)来加载数据集:
环境准备:
```bash
pip install modelscope
export DATASET_SOURCE=ModelScope
```
配置好环境后,无需下载全部数据,直接提交评测任务即可。目前支持的数据集有:
```bash
humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ceval, math, LCSTS, Xsum, winogrande, openbookqa, AGIEval, gsm8k, nq, race, siqa, mbpp, mmlu, hellaswag, ARC, BBH, xstory_cloze, summedits, GAOKAO-BENCH, OCNLI, cmnli
```
有部分第三方功能,如 Humaneval 以及 Llama,可能需要额外步骤才能正常运行,详细步骤请参考[安装指南](https://opencompass.readthedocs.io/zh_CN/latest/get_started/installation.html)。
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 🏗️ ️评测
在确保按照上述步骤正确安装了 OpenCompass 并准备好了数据集之后,现在您可以开始使用 OpenCompass 进行首次评估!
- ### 首次评测
OpenCompass 支持通过命令行界面 (CLI) 或 Python 脚本来设置配置。对于简单的评估设置,我们推荐使用 CLI而对于更复杂的评估则建议使用脚本方式。你可以在examples文件夹下找到更多脚本示例。
```bash
# 命令行界面 (CLI)
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen
# Python 脚本
opencompass examples/eval_chat_demo.py
```
你可以在[examples](./examples) 文件夹下找到更多的脚本示例。
- ### API评测
OpenCompass 在设计上并不区分开源模型与 API 模型。您可以以相同的方式或甚至在同一设置中评估这两种类型的模型。
```bash
export OPENAI_API_KEY="YOUR_OPEN_API_KEY"
# 命令行界面 (CLI)
opencompass --models gpt_4o_2024_05_13 --datasets demo_gsm8k_chat_gen
# Python 脚本
opencompass examples/eval_api_demo.py
# 现已支持 o1_mini_2024_09_12/o1_preview_2024_09_12 模型, 默认情况下 max_completion_tokens=8192.
```
- ### 推理后端
另外,如果您想使用除 HuggingFace 之外的推理后端来进行加速评估,比如 LMDeploy 或 vLLM可以通过以下命令进行。请确保您已经为所选的后端安装了必要的软件包并且您的模型支持该后端的加速推理。更多信息请参阅关于推理加速后端的文档 [这里](docs/zh_cn/advanced_guides/accelerator_intro.md)。以下是使用 LMDeploy 的示例:
```bash
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy
```
- ### 支持的模型与数据集
OpenCompass 预定义了许多模型和数据集的配置,你可以通过 [工具](./docs/zh_cn/tools.md#ListConfigs) 列出所有可用的模型和数据集配置。
```bash
# 列出所有配置
python tools/list_configs.py
# 列出所有跟 llama 及 mmlu 相关的配置
python tools/list_configs.py llama mmlu
```
#### 支持的模型
如果模型不在列表中,但支持 Huggingface AutoModel 类或支持针对 OpenAI 接口的推理引擎封装(详见[官方文档](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/new_model.html)),您仍然可以使用 OpenCompass 对其进行评估。欢迎您贡献维护 OpenCompass 支持的模型和数据集列表。
```bash
opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat
```
#### 支持的数据集
目前OpenCompass针对数据集给出了标准的推荐配置。通常`_gen.py`或`_llm_judge_gen.py`为结尾的配置文件将指向我们为该数据集提供的推荐配置。您可以参阅[官方文档](https://opencompass.readthedocs.io/zh-cn/latest/dataset_statistics.html) 的数据集统计章节来获取详细信息。
```bash
# 基于规则的推荐配置
opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
# 基于LLM Judge的推荐配置
opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
```
此外,如果你想在多块 GPU 上使用模型进行推理,您可以使用 `--max-num-worker` 参数。
```bash
CUDA_VISIBLE_DEVICES=0,1 opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat --max-num-worker 2
```
> \[!TIP\]
>
> `--hf-num-gpus` 用于 模型并行(huggingface 格式)`--max-num-worker` 用于数据并行。
> \[!TIP\]
>
> configuration with `_ppl` is designed for base model typically.
> 配置带 `_ppl` 的配置设计给基础模型使用。
> 配置带 `_gen` 的配置可以同时用于基础模型和对话模型。
通过命令行或配置文件OpenCompass 还支持评测 API 或自定义模型,以及更多样化的评测策略。请阅读[快速开始](https://opencompass.readthedocs.io/zh_CN/latest/get_started/quick_start.html)了解如何运行一个评测任务。
更多教程请查看我们的[文档](https://opencompass.readthedocs.io/zh_CN/latest/index.html)。
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 📣 OpenCompass 2.0
我们很高兴发布 OpenCompass 司南 2.0 大模型评测体系,它主要由三大核心模块构建而成:[CompassKit](https://github.com/open-compass)、[CompassHub](https://hub.opencompass.org.cn/home)以及[CompassRank](https://rank.opencompass.org.cn/home)。
**CompassRank** 系统进行了重大革新与提升,现已成为一个兼容并蓄的排行榜体系,不仅囊括了开源基准测试项目,还包含了私有基准测试。此番升级极大地拓宽了对行业内各类模型进行全面而深入测评的可能性。
**CompassHub** 创新性地推出了一个基准测试资源导航平台其设计初衷旨在简化和加快研究人员及行业从业者在多样化的基准测试库中进行搜索与利用的过程。为了让更多独具特色的基准测试成果得以在业内广泛传播和应用我们热忱欢迎各位将自定义的基准数据贡献至CompassHub平台。只需轻点鼠标通过访问[这里](https://hub.opencompass.org.cn/dataset-submit),即可启动提交流程。
**CompassKit** 是一系列专为大型语言模型和大型视觉-语言模型打造的强大评估工具合集,它所提供的全面评测工具集能够有效地对这些复杂模型的功能性能进行精准测量和科学评估。在此,我们诚挚邀请您在学术研究或产品研发过程中积极尝试运用我们的工具包,以助您取得更加丰硕的研究成果和产品优化效果。
## ✨ 介绍
![image](https://github.com/open-compass/opencompass/assets/22607038/30bcb2e2-3969-4ac5-9f29-ad3f4abb4f3b)
@ -80,350 +290,13 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
- **灵活化拓展**想增加新模型或数据集想要自定义更高级的任务分割策略甚至接入新的集群管理系统OpenCompass 的一切均可轻松扩展!
## 📊 性能榜单
我们将陆续提供开源模型和API模型的具体性能榜单请见 [OpenCompass Leaderboard](https://opencompass.org.cn/rank) 。如需加入评测,请提供模型仓库地址或标准的 API 接口至邮箱 `opencompass@pjlab.org.cn`.
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 🛠️ 安装
下面展示了快速安装以及准备数据集的步骤。
### 💻 环境配置
#### 面向开源模型的GPU环境
```bash
conda create --name opencompass python=3.10 pytorch torchvision pytorch-cuda -c nvidia -c pytorch -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
```
#### 面向API模型测试的CPU环境
```bash
conda create -n opencompass python=3.10 pytorch torchvision torchaudio cpuonly -c pytorch -y
conda activate opencompass
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
# 如果需要使用各个API模型`pip install -r requirements/api.txt` 安装API模型的相关依赖
```
### 📂 数据准备
```bash
# 下载数据集到 data/ 处
wget https://github.com/open-compass/opencompass/releases/download/0.1.8.rc1/OpenCompassData-core-20231110.zip
unzip OpenCompassData-core-20231110.zip
```
有部分第三方功能,如 Humaneval 以及 Llama,可能需要额外步骤才能正常运行,详细步骤请参考[安装指南](https://opencompass.readthedocs.io/zh_CN/latest/get_started/installation.html)。
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 🏗️ ️评测
确保按照上述步骤正确安装 OpenCompass 并准备好数据集后,可以通过以下命令评测 LLaMA-7b 模型在 MMLU 和 C-Eval 数据集上的性能:
```bash
python run.py --models hf_llama_7b --datasets mmlu_ppl ceval_ppl
```
OpenCompass 预定义了许多模型和数据集的配置,你可以通过 [工具](./docs/zh_cn/tools.md#ListConfigs) 列出所有可用的模型和数据集配置。
```bash
# 列出所有配置
python tools/list_configs.py
# 列出所有跟 llama 及 mmlu 相关的配置
python tools/list_configs.py llama mmlu
```
你也可以通过命令行去评测其它 HuggingFace 模型。同样以 LLaMA-7b 为例:
```bash
python run.py --datasets ceval_ppl mmlu_ppl \
--hf-path huggyllama/llama-7b \ # HuggingFace 模型地址
--model-kwargs device_map='auto' \ # 构造 model 的参数
--tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \ # 构造 tokenizer 的参数
--max-out-len 100 \ # 最长生成 token 数
--max-seq-len 2048 \ # 模型能接受的最大序列长度
--batch-size 8 \ # 批次大小
--no-batch-padding \ # 不打开 batch padding通过 for loop 推理,避免精度损失
--num-gpus 1 # 运行该模型所需的最少 gpu 数
```
> **注意**<br />
> 若需要运行上述命令,你需要删除所有从 `# ` 开始的注释。
通过命令行或配置文件OpenCompass 还支持评测 API 或自定义模型,以及更多样化的评测策略。请阅读[快速开始](https://opencompass.readthedocs.io/zh_CN/latest/get_started/quick_start.html)了解如何运行一个评测任务。
更多教程请查看我们的[文档](https://opencompass.readthedocs.io/zh_CN/latest/index.html)。
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 📖 数据集支持
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>语言</b>
</td>
<td>
<b>知识</b>
</td>
<td>
<b>推理</b>
</td>
<td>
<b>考试</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>字词释义</b></summary>
我们已经在OpenCompass官网的文档中支持了所有可在本平台上使用的数据集的统计列表。
- WiC
- SummEdits
您可以通过排序、筛选和搜索等功能从列表中快速找到您需要的数据集。
</details>
<details open>
<summary><b>成语习语</b></summary>
- CHID
</details>
<details open>
<summary><b>语义相似度</b></summary>
- AFQMC
- BUSTM
</details>
<details open>
<summary><b>指代消解</b></summary>
- CLUEWSC
- WSC
- WinoGrande
</details>
<details open>
<summary><b>翻译</b></summary>
- Flores
- IWSLT2017
</details>
<details open>
<summary><b>多语种问答</b></summary>
- TyDi-QA
- XCOPA
</details>
<details open>
<summary><b>多语种总结</b></summary>
- XLSum
</details>
</td>
<td>
<details open>
<summary><b>知识问答</b></summary>
- BoolQ
- CommonSenseQA
- NaturalQuestions
- TriviaQA
</details>
</td>
<td>
<details open>
<summary><b>文本蕴含</b></summary>
- CMNLI
- OCNLI
- OCNLI_FC
- AX-b
- AX-g
- CB
- RTE
- ANLI
</details>
<details open>
<summary><b>常识推理</b></summary>
- StoryCloze
- COPA
- ReCoRD
- HellaSwag
- PIQA
- SIQA
</details>
<details open>
<summary><b>数学推理</b></summary>
- MATH
- GSM8K
</details>
<details open>
<summary><b>定理应用</b></summary>
- TheoremQA
- StrategyQA
- SciBench
</details>
<details open>
<summary><b>综合推理</b></summary>
- BBH
</details>
</td>
<td>
<details open>
<summary><b>初中/高中/大学/职业考试</b></summary>
- C-Eval
- AGIEval
- MMLU
- GAOKAO-Bench
- CMMLU
- ARC
- Xiezhi
</details>
<details open>
<summary><b>医学考试</b></summary>
- CMB
</details>
</td>
</tr>
</td>
</tr>
</tbody>
<tbody>
<tr align="center" valign="bottom">
<td>
<b>理解</b>
</td>
<td>
<b>长文本</b>
</td>
<td>
<b>安全</b>
</td>
<td>
<b>代码</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>阅读理解</b></summary>
- C3
- CMRC
- DRCD
- MultiRC
- RACE
- DROP
- OpenBookQA
- SQuAD2.0
</details>
<details open>
<summary><b>内容总结</b></summary>
- CSL
- LCSTS
- XSum
- SummScreen
</details>
<details open>
<summary><b>内容分析</b></summary>
- EPRSTMT
- LAMBADA
- TNEWS
</details>
</td>
<td>
<details open>
<summary><b>长文本理解</b></summary>
- LEval
- LongBench
- GovReports
- NarrativeQA
- Qasper
</details>
</td>
<td>
<details open>
<summary><b>安全</b></summary>
- CivilComments
- CrowsPairs
- CValues
- JigsawMultilingual
- TruthfulQA
</details>
<details open>
<summary><b>健壮性</b></summary>
- AdvGLUE
</details>
</td>
<td>
<details open>
<summary><b>代码</b></summary>
- HumanEval
- HumanEvalX
- MBPP
- APPs
- DS1000
</details>
</td>
</tr>
</td>
</tr>
</tbody>
</table>
详情请参阅 [官方文档](https://opencompass.readthedocs.io/zh-cn/latest/dataset_statistics.html) 的数据集统计章节。
<p align="right"><a href="#top">🔝返回顶部</a></p>
@ -445,23 +318,27 @@ python run.py --datasets ceval_ppl mmlu_ppl \
<tr valign="top">
<td>
- [InternLM](https://github.com/InternLM/InternLM)
- [LLaMA](https://github.com/facebookresearch/llama)
- [Vicuna](https://github.com/lm-sys/FastChat)
- [Alpaca](https://github.com/tatsu-lab/stanford_alpaca)
- [Baichuan](https://github.com/baichuan-inc)
- [WizardLM](https://github.com/nlpxucan/WizardLM)
- [BlueLM](https://github.com/vivo-ai-lab/BlueLM)
- [ChatGLM2](https://github.com/THUDM/ChatGLM2-6B)
- [ChatGLM3](https://github.com/THUDM/ChatGLM3-6B)
- [TigerBot](https://github.com/TigerResearch/TigerBot)
- [Gemma](https://huggingface.co/google/gemma-7b)
- [InternLM](https://github.com/InternLM/InternLM)
- [LLaMA](https://github.com/facebookresearch/llama)
- [LLaMA3](https://github.com/meta-llama/llama3)
- [Qwen](https://github.com/QwenLM/Qwen)
- [BlueLM](https://github.com/vivo-ai-lab/BlueLM)
- [TigerBot](https://github.com/TigerResearch/TigerBot)
- [Vicuna](https://github.com/lm-sys/FastChat)
- [WizardLM](https://github.com/nlpxucan/WizardLM)
- [Yi](https://github.com/01-ai/Yi)
- ……
</td>
<td>
- OpenAI
- Gemini
- Claude
- ZhipuAI(ChatGLM)
- Baichuan
@ -484,18 +361,18 @@ python run.py --datasets ceval_ppl mmlu_ppl \
## 🔜 路线图
- [ ] 主观评测
- [ ] 发布主观评测榜单
- [ ] 发布主观评测数据集
- [x] 主观评测
- [x] 发布主观评测榜单
- [x] 发布主观评测数据集
- [x] 长文本
- [ ] 支持广泛的长文本评测集
- [x] 支持广泛的长文本评测集
- [ ] 发布长文本评测榜单
- [ ] 代码能力
- [x] 代码能力
- [ ] 发布代码能力评测榜单
- [x] 提供非Python语言的评测服务
- [ ] 智能体
- [x] 智能体
- [ ] 支持丰富的智能体方案
- [ ] 提供智能体评测榜单
- [x] 提供智能体评测榜单
- [x] 鲁棒性
- [x] 支持各类攻击方法
@ -503,6 +380,16 @@ python run.py --datasets ceval_ppl mmlu_ppl \
我们感谢所有的贡献者为改进和提升 OpenCompass 所作出的努力。请参考[贡献指南](https://opencompass.readthedocs.io/zh_CN/latest/notes/contribution_guide.html)来了解参与项目贡献的相关指引。
<a href="https://github.com/open-compass/opencompass/graphs/contributors" target="_blank">
<table>
<tr>
<th colspan="2">
<br><img src="https://contrib.rocks/image?repo=open-compass/opencompass"><br><br>
</th>
</tr>
</table>
</a>
## 🤝 致谢
该项目部分的代码引用并修改自 [OpenICL](https://github.com/Shark-NLP/OpenICL)。
@ -521,3 +408,20 @@ python run.py --datasets ceval_ppl mmlu_ppl \
```
<p align="right"><a href="#top">🔝返回顶部</a></p>
[github-contributors-link]: https://github.com/open-compass/opencompass/graphs/contributors
[github-contributors-shield]: https://img.shields.io/github/contributors/open-compass/opencompass?color=c4f042&labelColor=black&style=flat-square
[github-forks-link]: https://github.com/open-compass/opencompass/network/members
[github-forks-shield]: https://img.shields.io/github/forks/open-compass/opencompass?color=8ae8ff&labelColor=black&style=flat-square
[github-issues-link]: https://github.com/open-compass/opencompass/issues
[github-issues-shield]: https://img.shields.io/github/issues/open-compass/opencompass?color=ff80eb&labelColor=black&style=flat-square
[github-license-link]: https://github.com/open-compass/opencompass/blob/main/LICENSE
[github-license-shield]: https://img.shields.io/github/license/open-compass/opencompass?color=white&labelColor=black&style=flat-square
[github-release-link]: https://github.com/open-compass/opencompass/releases
[github-release-shield]: https://img.shields.io/github/v/release/open-compass/opencompass?color=369eff&labelColor=black&logo=github&style=flat-square
[github-releasedate-link]: https://github.com/open-compass/opencompass/releases
[github-releasedate-shield]: https://img.shields.io/github/release-date/open-compass/opencompass?labelColor=black&style=flat-square
[github-stars-link]: https://github.com/open-compass/opencompass/stargazers
[github-stars-shield]: https://img.shields.io/github/stars/open-compass/opencompass?color=ffcb47&labelColor=black&style=flat-square
[github-trending-shield]: https://trendshift.io/api/badge/repositories/6630
[github-trending-url]: https://trendshift.io/repositories/6630

View File

@ -1,36 +0,0 @@
from mmengine.config import read_base
from opencompass.models import AI360GPT
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from ..summarizers.medium import summarizer
from ..datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='360GPT_S2_V9',
type=AI360GPT,
path='360GPT_S2_V9',
key="xxxxxxxxxxxx",
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir ="./output/api_360GPT_S2_V9"

View File

@ -1,38 +0,0 @@
from mmengine.config import read_base
from opencompass.models import BaiChuan
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from ..summarizers.medium import summarizer
from ..datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='Baichuan2-53B',
type=BaiChuan,
path='Baichuan2-53B',
api_key='xxxxxx',
secret_key="xxxxx",
url="xxxxx",
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = "outputs/api_baichuan53b/"

View File

@ -1,38 +0,0 @@
from mmengine.config import read_base
from opencompass.models import ERNIEBot
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from ..summarizers.medium import summarizer
from ..datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='erniebot',
type=ERNIEBot,
path='erniebot',
key='xxxxxx', # please give you key
secretkey='xxxxxxxxx', # please give your group_id
url='xxxxxxxxx',
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = "outputs/api_erniebot/"

View File

@ -1,39 +0,0 @@
from mmengine.config import read_base
from opencompass.models import ByteDance
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
# from .datasets.collections.chat_medium import datasets
from ..summarizers.medium import summarizer
from ..datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='skylark-pro-public',
type=ByteDance,
path='skylark-pro-public',
accesskey="xxxxxxx",
secretkey="xxxxxxx",
url='xxxxxx',
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = "outputs/api_bytedance/"

View File

@ -1,37 +0,0 @@
from mmengine.config import read_base
from opencompass.models import MiniMax
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from ..summarizers.medium import summarizer
from ..datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='minimax_abab5.5-chat',
type=MiniMax,
path='abab5.5-chat',
key='xxxxxxx', # please give you key
group_id='xxxxxxxx', # please give your group_id
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=4,
concurrent_users=4,
task=dict(type=OpenICLInferTask)),
)
work_dir = "outputs/api_minimax/"

View File

@ -1,37 +0,0 @@
from mmengine.config import read_base
from opencompass.models import MoonShot
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from ..summarizers.medium import summarizer
from ..datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='moonshot-v1-32k',
type=MoonShot,
path='moonshot-v1-32k',
key='xxxxxxx',
url= 'xxxxxxxx',
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=4,
concurrent_users=4,
task=dict(type=OpenICLInferTask)),
)
work_dir = "outputs/api_moonshot/"

View File

@ -1,42 +0,0 @@
from mmengine.config import read_base
from opencompass.models import PanGu
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from ..summarizers.medium import summarizer
from ..datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='pangu',
type=PanGu,
path='pangu',
access_key="xxxxxx",
secret_key="xxxxxx",
url = "xxxxxx",
# url of token sever, used for generate token, like "https://xxxxxx.myhuaweicloud.com/v3/auth/tokens",
token_url = "xxxxxx",
# scope-project-name, used for generate token
project_name = "xxxxxx",
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = "outputs/api_pangu/"

View File

@ -1,37 +0,0 @@
from mmengine.config import read_base
from opencompass.models import SenseTime
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from ..summarizers.medium import summarizer
from ..datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='nova-ptc-xl-v1',
type=SenseTime,
path='nova-ptc-xl-v1',
key='xxxxxxxxxxxxxx',
url='xxxxxxxxxxx',
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = "outputs/api_sensetime/"

View File

@ -1,51 +0,0 @@
from mmengine.config import read_base
from opencompass.models.xunfei_api import XunFei
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
# from .datasets.collections.chat_medium import datasets
from ..summarizers.medium import summarizer
from ..datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='Spark-v1-1',
type=XunFei,
appid="xxxx",
path='ws://spark-api.xf-yun.com/v1.1/chat',
api_secret = "xxxxxxx",
api_key = "xxxxxxx",
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
dict(
abbr='Spark-v3-1',
type=XunFei,
appid="xxxx",
domain='generalv3',
path='ws://spark-api.xf-yun.com/v3.1/chat',
api_secret = "xxxxxxxx",
api_key = "xxxxxxxxx",
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = "outputs/api_xunfei/"

View File

@ -1,48 +0,0 @@
from mmengine.config import read_base
from opencompass.models import ZhiPuAI
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
# from .datasets.collections.chat_medium import datasets
from ..summarizers.medium import summarizer
from ..datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
# needs a special postprocessor for all
# except 'gsm8k' and 'strategyqa'
from opencompass.utils import general_eval_wrapper_postprocess
for _dataset in datasets:
if _dataset['abbr'] not in ['gsm8k', 'strategyqa']:
if hasattr(_dataset['eval_cfg'], 'pred_postprocessor'):
_dataset['eval_cfg']['pred_postprocessor']['postprocess'] = _dataset['eval_cfg']['pred_postprocessor']['type']
_dataset['eval_cfg']['pred_postprocessor']['type'] = general_eval_wrapper_postprocess
else:
_dataset['eval_cfg']['pred_postprocessor'] = {'type': general_eval_wrapper_postprocess}
models = [
dict(
abbr='chatglm_pro',
type=ZhiPuAI,
path='chatglm_pro',
key='xxxxxxxxxxxx',
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = "outputs/api_zhipu/"

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .CIBench_gen_eb42f9 import ci_datasets # noqa: F401, F403

View File

@ -1,58 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import MathBenchDataset, mathbench_postprocess
cloze_prompts ={
"cloze_arith_en": [
dict(role='HUMAN', prompt='Q: Calculate (341/11)/(9/(-6)*(-2)/3).'),
dict(role='BOT', prompt='A: First, (9/(-6)*(-2)/3) can be simplified by : 9/(-6) = -1.5, -1.5 * (-2) = 3, 3 / 3 = 1. So, (9/(-6)*(-2)/3) is equal to 1. Now, we have `(341/11)/1` equals `341/11`. Finally, calculate `341/11 = 31`. The answer is 31.\n'),
dict(role='HUMAN', prompt='Q: In base 14, what is 5 - 638d8d?'),
dict(role='BOT', prompt='A: 5 - 638d8d = -638d88. The answer is -638d88.\n'),
dict(role='HUMAN', prompt='Q: What is -491354 times -0.34?'),
dict(role='BOT', prompt='A: The product of -491354 and -0.34 is 167060.36. The answer is 167060.36.\n'),
dict(role='HUMAN', prompt='Q: What is the value of (-55)/(6930/(-382)) + (0 - 3)?.'),
dict(role='BOT', prompt='A: First, (-55)/(6930/(-382)) = (-55)/(-(6930/382)) = 55*382/6930 = 21010/6930 = 2101/693. Then, 2101/693 + (0 - 3) = 2101/693 - 3 = 2101/693 - 3*693/693 = (2101-2079)/693 = 22/693 = 2/63. The answer is 2/63.\n'),
dict(role='HUMAN', prompt='Q: {question}'),
dict(role='BOT', prompt='A: {answer}\n'),
]
}
mathbench_sets = {
'arithmetic': ['cloze_arith_en'],
}
mathbench_datasets = []
for _split in list(mathbench_sets.keys()):
for _name in mathbench_sets[_split]:
mathbench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=cloze_prompts[_name],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
)
mathbench_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_postprocessor=dict(type=mathbench_postprocess, name=_name))
mathbench_datasets.append(
dict(
type=MathBenchDataset,
path=f"./data/mathbench/{_split}",
name=_name,
with_circular=False,
abbr="mathbench-arithmetic" + _split + '-' + _name,
reader_cfg=dict(
input_columns=["question"],
output_column="answer"
),
infer_cfg=mathbench_infer_cfg,
eval_cfg=mathbench_eval_cfg,
))

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .TheoremQA_gen_7009de import TheoremQA_datasets # noqa: F401, F403

View File

@ -1,40 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import TheoremQADataset, TheoremQA_postprocess
TheoremQA_reader_cfg = dict(
input_columns=['Question', 'Answer_type'],
output_column='Answer',
train_split='test')
TheoremQA_prompt1 = "Please read a math problem, and then think step by step to derive the answer. The answer is decided by Answer Type. " \
"If the Answer type in [bool], the answer needs to be True or False. " \
"Else if the Answer type in [integer, float] , The answer needs to be in numerical form. " \
"Else if the Answer type in [list of integer, list of float] , the answer needs to be a list of number like [2, 3, 4]. " \
"Else if the Answer type in [option], the answer needs to be an option like (a), (b), (c), (d)." \
"You need to output the answer in your final sentence like 'Therefore, the answer is ...'."
TheoremQA_prompt2 = f"Below is an instruction that describes a task, paired with an input that provides further context. " \
f"Write a response that appropriately completes the request.\n\n### Instruction:\n{TheoremQA_prompt1}\n\n### Input:\n{{Question}}\nAnswer_type:{{Answer_type}}\n### Response:\n"
TheoremQA_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=TheoremQA_prompt2),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512))
TheoremQA_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_postprocessor=dict(type=TheoremQA_postprocess))
TheoremQA_datasets = [
dict(
abbr='TheoremQA',
type=TheoremQADataset,
path="./data/TheoremQA/test.csv",
reader_cfg=TheoremQA_reader_cfg,
infer_cfg=TheoremQA_infer_cfg,
eval_cfg=TheoremQA_eval_cfg)
]

View File

@ -1,37 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import TheoremQADataset, TheoremQA_postprocess
TheoremQA_reader_cfg = dict(
input_columns=['Question', 'Answer_type'],
output_column='Answer',
train_split='test')
TheoremQA_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=
"""You are a mathematician, you are supposed to answer the given question. You need to output the answer in your final sentence like "Therefore, the answer is ...". The answer can only be one of the following forms:\n1. a numerical value like 0.1, no symbol and no unit at all.\n2. a list of number like [2, 3, 4].\n3. True/False.\n4. an option like (a), (b), (c), (d)\nQuestion: {Question}\nLet\'s think step by step."""
),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512))
TheoremQA_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_postprocessor=dict(type=TheoremQA_postprocess))
TheoremQA_datasets = [
dict(
abbr='TheoremQA',
type=TheoremQADataset,
path="./data/TheoremQA/test.csv",
reader_cfg=TheoremQA_reader_cfg,
infer_cfg=TheoremQA_infer_cfg,
eval_cfg=TheoremQA_eval_cfg)
]

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .apps_gen_7fbb95 import apps_datasets # noqa: F401, F403

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .bbh_gen_5b92b0 import bbh_datasets # noqa: F401, F403

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .cmmlu_gen_c13365 import cmmlu_datasets # noqa: F401, F403

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .drop_gen_599f07 import drop_datasets # noqa: F401, F403

View File

@ -1,162 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import TopkRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import BleuEvaluator
from opencompass.datasets import FloresFirst100Dataset
_flores_lang_map = [
["eng", "eng_Latn", "English", "Indo-European-Germanic"],
["afr", "afr_Latn", "Afrikaans", "Indo-European-Germanic"],
["dan", "dan_Latn", "Danish", "Indo-European-Germanic"],
["deu", "deu_Latn", "German", "Indo-European-Germanic"],
["isl", "isl_Latn", "Icelandic", "Indo-European-Germanic"],
["ltz", "ltz_Latn", "Luxembourgish", "Indo-European-Germanic"],
["nld", "nld_Latn", "Dutch", "Indo-European-Germanic"],
["nob", "nob_Latn", "Norwegian", "Indo-European-Germanic"],
["swe", "swe_Latn", "Swedish", "Indo-European-Germanic"],
["ast", "ast_Latn", "Asturian", "Indo-European-Romance"],
["cat", "cat_Latn", "Catalan", "Indo-European-Romance"],
["fra", "fra_Latn", "French", "Indo-European-Romance"],
["glg", "glg_Latn", "Galician", "Indo-European-Romance"],
["oci", "oci_Latn", "Occitan", "Indo-European-Romance"],
["por", "por_Latn", "Portuguese", "Indo-European-Romance"],
["ron", "ron_Latn", "Romanian", "Indo-European-Romance"],
["spa", "spa_Latn", "Spanish", "Indo-European-Romance"],
["bel", "bel_Cyrl", "Belarusian", "Indo-European-Slavic"],
["bos", "bos_Latn", "Bosnian", "Indo-European-Slavic"],
["bul", "bul_Cyrl", "Bulgarian", "Indo-European-Slavic"],
["ces", "ces_Latn", "Czech", "Indo-European-Slavic"],
["hrv", "hrv_Latn", "Croatian", "Indo-European-Slavic"],
["mkd", "mkd_Cyrl", "Macedonian", "Indo-European-Slavic"],
["pol", "pol_Latn", "Polish", "Indo-European-Slavic"],
["rus", "rus_Cyrl", "Russian", "Indo-European-Slavic"],
["slk", "slk_Latn", "Slovak", "Indo-European-Slavic"],
["slv", "slv_Latn", "Slovenian", "Indo-European-Slavic"],
["srp", "srp_Cyrl", "Serbian", "Indo-European-Slavic"],
["ukr", "ukr_Cyrl", "Ukrainian", "Indo-European-Slavic"],
["asm", "asm_Beng", "Assamese", "Indo-European-Indo-Aryan"],
["ben", "ben_Beng", "Bengali", "Indo-European-Indo-Aryan"],
["guj", "guj_Gujr", "Gujarati", "Indo-European-Indo-Aryan"],
["hin", "hin_Deva", "Hindi", "Indo-European-Indo-Aryan"],
["mar", "mar_Deva", "Marathi", "Indo-European-Indo-Aryan"],
["npi", "npi_Deva", "Nepali", "Indo-European-Indo-Aryan"],
["ory", "ory_Orya", "Oriya", "Indo-European-Indo-Aryan"],
["pan", "pan_Guru", "Punjabi", "Indo-European-Indo-Aryan"],
["snd", "snd_Arab", "Sindhi", "Indo-European-Indo-Aryan"],
["urd", "urd_Arab", "Urdu", "Indo-European-Indo-Aryan"],
["ckb", "ckb_Arab", "Kurdish", "Indo-European-Other"],
["cym", "cym_Latn", "Welsh", "Indo-European-Other"],
["ell", "ell_Grek", "Greek", "Indo-European-Other"],
["fas", "pes_Arab", "Persian", "Indo-European-Other"],
["gle", "gle_Latn", "Irish", "Indo-European-Other"],
["hye", "hye_Armn", "Armenian", "Indo-European-Other"],
["ita", "ita_Latn", "Italian", "Indo-European-Other"],
["lav", "lvs_Latn", "Latvian", "Indo-European-Other"],
["lit", "lit_Latn", "Lithuanian", "Indo-European-Other"],
["pus", "pbt_Arab", "Pashto", "Indo-European-Other"],
["tgk", "tgk_Cyrl", "Tajik", "Indo-European-Other"],
["ceb", "ceb_Latn", "Cebuano", "Austronesian"],
["ind", "ind_Latn", "Indonesian", "Austronesian"],
["jav", "jav_Latn", "Javanese", "Austronesian"],
["mri", "mri_Latn", "Maori", "Austronesian"],
["msa", "zsm_Latn", "Malay", "Austronesian"],
["tgl", "tgl_Latn", "Tagalog", "Austronesian"],
["ibo", "ibo_Latn", "Igbo", "Atlantic-Congo"],
["kam", "kam_Latn", "Kamba", "Atlantic-Congo"],
["kea", "kea_Latn", "Kabuverdianu", "Atlantic-Congo"],
["lin", "lin_Latn", "Lingala", "Atlantic-Congo"],
["lug", "lug_Latn", "Luganda", "Atlantic-Congo"],
["nso", "nso_Latn", "Northern Sotho", "Atlantic-Congo"],
["nya", "nya_Latn", "Nyanja", "Atlantic-Congo"],
["sna", "sna_Latn", "Shona", "Atlantic-Congo"],
["swh", "swh_Latn", "Swahili", "Atlantic-Congo"],
["umb", "umb_Latn", "Umbundu", "Atlantic-Congo"],
["wol", "wol_Latn", "Wolof", "Atlantic-Congo"],
["xho", "xho_Latn", "Xhosa", "Atlantic-Congo"],
["yor", "yor_Latn", "Yoruba", "Atlantic-Congo"],
["zul", "zul_Latn", "Zulu", "Atlantic-Congo"],
["amh", "amh_Ethi", "Amharic", "Afro-Asiatic"],
["ara", "arb_Arab", "Arabic", "Afro-Asiatic"],
["ful", "fuv_Latn", "Fulah", "Afro-Asiatic"],
["mlt", "mlt_Latn", "Maltese", "Afro-Asiatic"],
["orm", "gaz_Latn", "Oromo", "Afro-Asiatic"],
["som", "som_Latn", "Somali", "Afro-Asiatic"],
["azj", "azj_Latn", "Azerbaijani", "Turkic"],
["kaz", "kaz_Cyrl", "Kazakh", "Turkic"],
["kir", "kir_Cyrl", "Kyrgyz", "Turkic"],
["tur", "tur_Latn", "Turkish", "Turkic"],
["uzb", "uzn_Latn", "Uzbek", "Turkic"],
["kan", "kan_Knda", "Kannada", "Dravidian"],
["mal", "mal_Mlym", "Malayalam", "Dravidian"],
["tam", "tam_Taml", "Tamil", "Dravidian"],
["tel", "tel_Telu", "Telugu", "Dravidian"],
["mya", "mya_Mymr", "Burmese", "Sino-Tibetan"],
["zho_simpl", "zho_Hans", "Chinese (Simpl)", "Sino-Tibetan"],
["zho_trad", "zho_Hant", "Chinese (Trad)", "Sino-Tibetan"],
["est", "est_Latn", "Estonian", "Other"],
["fin", "fin_Latn", "Finnish", "Other"],
["hau", "hau_Latn", "Hausa", "Other"],
["heb", "heb_Hebr", "Hebrew", "Other"],
["hun", "hun_Latn", "Hungarian", "Other"],
["jpn", "jpn_Jpan", "Japanese", "Other"],
["kat", "kat_Geor", "Georgian", "Other"],
["khm", "khm_Khmr", "Khmer", "Other"],
["kor", "kor_Hang", "Korean", "Other"],
["lao", "lao_Laoo", "Lao", "Other"],
["luo", "luo_Latn", "Luo", "Other"],
["mon", "khk_Cyrl", "Mongolian", "Other"],
["tha", "tha_Thai", "Thai", "Other"],
["vie", "vie_Latn", "Vietnamese", "Other"],
]
flores_lang_map = {i[0]: i for i in _flores_lang_map}
_flores_subtasks = [f"eng-{i}" for i in flores_lang_map if i != "eng"
] + [f"{i}-eng" for i in flores_lang_map if i != "eng"]
flores_datasets = []
for _flores_subtask in _flores_subtasks:
_src, _tgt = _flores_subtask.split("-")
_, _flores_source, _src_inst, _ = flores_lang_map[_src]
_, _flores_target, _tgt_inst, _ = flores_lang_map[_tgt]
flores_reader_cfg = dict(
input_columns=f"sentence_{_flores_source}",
output_column=f"sentence_{_flores_target}",
train_split="dev",
test_split="devtest"
)
flores_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template=dict(
begin="</E>",
round=[
dict(
role="HUMAN",
prompt=
f"Translate the following {_src_inst} statements to {_tgt_inst}.\n{{sentence_{_flores_source}}}"
),
dict(role="BOT", prompt=f"{{sentence_{_flores_target}}}"),
],
),
ice_token="</E>",
),
retriever=dict(type=TopkRetriever, ice_num=8),
inferencer=dict(type=GenInferencer),
)
flores_eval_cfg = dict(
evaluator=dict(type=BleuEvaluator),
pred_role="BOT",
)
if _tgt == "zho_simpl":
flores_eval_cfg["pred_postprocessor"] = dict(type="flores")
flores_eval_cfg["dataset_postprocessor"] = dict(type="flores")
flores_datasets.append(
dict(
abbr=f"flores_100_{_src}-{_tgt}",
type=FloresFirst100Dataset,
path='./data/flores_first100',
name=f"{_flores_source}-{_flores_target}",
reader_cfg=flores_reader_cfg.copy(),
infer_cfg=flores_infer_cfg.copy(),
eval_cfg=flores_eval_cfg.copy(),
))

View File

@ -1,155 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import TopkRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import BleuEvaluator
from opencompass.datasets import FloresFirst100Dataset
_flores_lang_map = [
["eng", "eng_Latn", "English", "Indo-European-Germanic"],
["afr", "afr_Latn", "Afrikaans", "Indo-European-Germanic"],
["dan", "dan_Latn", "Danish", "Indo-European-Germanic"],
["deu", "deu_Latn", "German", "Indo-European-Germanic"],
["isl", "isl_Latn", "Icelandic", "Indo-European-Germanic"],
["ltz", "ltz_Latn", "Luxembourgish", "Indo-European-Germanic"],
["nld", "nld_Latn", "Dutch", "Indo-European-Germanic"],
["nob", "nob_Latn", "Norwegian", "Indo-European-Germanic"],
["swe", "swe_Latn", "Swedish", "Indo-European-Germanic"],
["ast", "ast_Latn", "Asturian", "Indo-European-Romance"],
["cat", "cat_Latn", "Catalan", "Indo-European-Romance"],
["fra", "fra_Latn", "French", "Indo-European-Romance"],
["glg", "glg_Latn", "Galician", "Indo-European-Romance"],
["oci", "oci_Latn", "Occitan", "Indo-European-Romance"],
["por", "por_Latn", "Portuguese", "Indo-European-Romance"],
["ron", "ron_Latn", "Romanian", "Indo-European-Romance"],
["spa", "spa_Latn", "Spanish", "Indo-European-Romance"],
["bel", "bel_Cyrl", "Belarusian", "Indo-European-Slavic"],
["bos", "bos_Latn", "Bosnian", "Indo-European-Slavic"],
["bul", "bul_Cyrl", "Bulgarian", "Indo-European-Slavic"],
["ces", "ces_Latn", "Czech", "Indo-European-Slavic"],
["hrv", "hrv_Latn", "Croatian", "Indo-European-Slavic"],
["mkd", "mkd_Cyrl", "Macedonian", "Indo-European-Slavic"],
["pol", "pol_Latn", "Polish", "Indo-European-Slavic"],
["rus", "rus_Cyrl", "Russian", "Indo-European-Slavic"],
["slk", "slk_Latn", "Slovak", "Indo-European-Slavic"],
["slv", "slv_Latn", "Slovenian", "Indo-European-Slavic"],
["srp", "srp_Cyrl", "Serbian", "Indo-European-Slavic"],
["ukr", "ukr_Cyrl", "Ukrainian", "Indo-European-Slavic"],
["asm", "asm_Beng", "Assamese", "Indo-European-Indo-Aryan"],
["ben", "ben_Beng", "Bengali", "Indo-European-Indo-Aryan"],
["guj", "guj_Gujr", "Gujarati", "Indo-European-Indo-Aryan"],
["hin", "hin_Deva", "Hindi", "Indo-European-Indo-Aryan"],
["mar", "mar_Deva", "Marathi", "Indo-European-Indo-Aryan"],
["npi", "npi_Deva", "Nepali", "Indo-European-Indo-Aryan"],
["ory", "ory_Orya", "Oriya", "Indo-European-Indo-Aryan"],
["pan", "pan_Guru", "Punjabi", "Indo-European-Indo-Aryan"],
["snd", "snd_Arab", "Sindhi", "Indo-European-Indo-Aryan"],
["urd", "urd_Arab", "Urdu", "Indo-European-Indo-Aryan"],
["ckb", "ckb_Arab", "Kurdish", "Indo-European-Other"],
["cym", "cym_Latn", "Welsh", "Indo-European-Other"],
["ell", "ell_Grek", "Greek", "Indo-European-Other"],
["fas", "pes_Arab", "Persian", "Indo-European-Other"],
["gle", "gle_Latn", "Irish", "Indo-European-Other"],
["hye", "hye_Armn", "Armenian", "Indo-European-Other"],
["ita", "ita_Latn", "Italian", "Indo-European-Other"],
["lav", "lvs_Latn", "Latvian", "Indo-European-Other"],
["lit", "lit_Latn", "Lithuanian", "Indo-European-Other"],
["pus", "pbt_Arab", "Pashto", "Indo-European-Other"],
["tgk", "tgk_Cyrl", "Tajik", "Indo-European-Other"],
["ceb", "ceb_Latn", "Cebuano", "Austronesian"],
["ind", "ind_Latn", "Indonesian", "Austronesian"],
["jav", "jav_Latn", "Javanese", "Austronesian"],
["mri", "mri_Latn", "Maori", "Austronesian"],
["msa", "zsm_Latn", "Malay", "Austronesian"],
["tgl", "tgl_Latn", "Tagalog", "Austronesian"],
["ibo", "ibo_Latn", "Igbo", "Atlantic-Congo"],
["kam", "kam_Latn", "Kamba", "Atlantic-Congo"],
["kea", "kea_Latn", "Kabuverdianu", "Atlantic-Congo"],
["lin", "lin_Latn", "Lingala", "Atlantic-Congo"],
["lug", "lug_Latn", "Luganda", "Atlantic-Congo"],
["nso", "nso_Latn", "Northern Sotho", "Atlantic-Congo"],
["nya", "nya_Latn", "Nyanja", "Atlantic-Congo"],
["sna", "sna_Latn", "Shona", "Atlantic-Congo"],
["swh", "swh_Latn", "Swahili", "Atlantic-Congo"],
["umb", "umb_Latn", "Umbundu", "Atlantic-Congo"],
["wol", "wol_Latn", "Wolof", "Atlantic-Congo"],
["xho", "xho_Latn", "Xhosa", "Atlantic-Congo"],
["yor", "yor_Latn", "Yoruba", "Atlantic-Congo"],
["zul", "zul_Latn", "Zulu", "Atlantic-Congo"],
["amh", "amh_Ethi", "Amharic", "Afro-Asiatic"],
["ara", "arb_Arab", "Arabic", "Afro-Asiatic"],
["ful", "fuv_Latn", "Fulah", "Afro-Asiatic"],
["mlt", "mlt_Latn", "Maltese", "Afro-Asiatic"],
["orm", "gaz_Latn", "Oromo", "Afro-Asiatic"],
["som", "som_Latn", "Somali", "Afro-Asiatic"],
["azj", "azj_Latn", "Azerbaijani", "Turkic"],
["kaz", "kaz_Cyrl", "Kazakh", "Turkic"],
["kir", "kir_Cyrl", "Kyrgyz", "Turkic"],
["tur", "tur_Latn", "Turkish", "Turkic"],
["uzb", "uzn_Latn", "Uzbek", "Turkic"],
["kan", "kan_Knda", "Kannada", "Dravidian"],
["mal", "mal_Mlym", "Malayalam", "Dravidian"],
["tam", "tam_Taml", "Tamil", "Dravidian"],
["tel", "tel_Telu", "Telugu", "Dravidian"],
["mya", "mya_Mymr", "Burmese", "Sino-Tibetan"],
["zho_simpl", "zho_Hans", "Chinese (Simpl)", "Sino-Tibetan"],
["zho_trad", "zho_Hant", "Chinese (Trad)", "Sino-Tibetan"],
["est", "est_Latn", "Estonian", "Other"],
["fin", "fin_Latn", "Finnish", "Other"],
["hau", "hau_Latn", "Hausa", "Other"],
["heb", "heb_Hebr", "Hebrew", "Other"],
["hun", "hun_Latn", "Hungarian", "Other"],
["jpn", "jpn_Jpan", "Japanese", "Other"],
["kat", "kat_Geor", "Georgian", "Other"],
["khm", "khm_Khmr", "Khmer", "Other"],
["kor", "kor_Hang", "Korean", "Other"],
["lao", "lao_Laoo", "Lao", "Other"],
["luo", "luo_Latn", "Luo", "Other"],
["mon", "khk_Cyrl", "Mongolian", "Other"],
["tha", "tha_Thai", "Thai", "Other"],
["vie", "vie_Latn", "Vietnamese", "Other"],
]
flores_lang_map = {i[0]: i for i in _flores_lang_map}
_flores_subtasks = [f"eng-{i}" for i in flores_lang_map if i != "eng"
] + [f"{i}-eng" for i in flores_lang_map if i != "eng"]
flores_datasets = []
for _flores_subtask in _flores_subtasks:
_src, _tgt = _flores_subtask.split("-")
_, _flores_source, _src_inst, _ = flores_lang_map[_src]
_, _flores_target, _tgt_inst, _ = flores_lang_map[_tgt]
flores_reader_cfg = dict(
input_columns=f"sentence_{_flores_source}",
output_column=f"sentence_{_flores_target}",
train_split="dev",
test_split="devtest"
)
flores_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template=f"</E>{{sentence_{_flores_source}}} = {{sentence_{_flores_target}}}" if _flores_subtask != "zho_simpl-eng"
else f"</E>Chinese: {{sentence_{_flores_source}}}\nEnglish: {{sentence_{_flores_target}}}",
ice_token="</E>",
),
retriever=dict(type=TopkRetriever, ice_num=8),
inferencer=dict(type=GenInferencer),
)
flores_eval_cfg = dict(
evaluator=dict(type=BleuEvaluator),
pred_role="BOT",
pred_postprocessor=dict(type="flores"),
dataset_postprocessor=dict(type="flores"),
)
if _tgt == "zho_simpl":
flores_eval_cfg["pred_postprocessor"] = dict(type="flores-chinese")
flores_eval_cfg["dataset_postprocessor"] = dict(type="flores-chinese")
flores_datasets.append(
dict(
abbr=f"flores_100_{_src}-{_tgt}",
type=FloresFirst100Dataset,
path='./data/flores_first100',
name=f"{_flores_source}-{_flores_target}",
reader_cfg=flores_reader_cfg.copy(),
infer_cfg=flores_infer_cfg.copy(),
eval_cfg=flores_eval_cfg.copy(),
))

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .hellaswag_gen_6faab5 import hellaswag_datasets # noqa: F401, F403

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .humaneval_gen_8e312c import humaneval_datasets # noqa: F401, F403

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .math_gen_265cce import math_datasets # noqa: F401, F403

View File

@ -1,68 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import MATHDataset, MATHEvaluator, math_postprocess
math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
math_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role="HUMAN",
prompt=
"Problem:\nFind the domain of the expression $\\frac{{\sqrt{{x-2}}}}{{\sqrt{{5-x}}}}$.}}\nSolution:"
),
dict(
role="BOT",
prompt=
"The expressions inside each square root must be non-negative. Therefore, $x-2 \ge 0$, so $x\ge2$, and $5 - x \ge 0$, so $x \le 5$. Also, the denominator cannot be equal to zero, so $5-x>0$, which gives $x<5$. Therefore, the domain of the expression is $\\boxed{{[2,5)}}$.\nFinal Answer: The final answer is $[2,5)$. I hope it is correct.\n"
),
dict(
role="HUMAN",
prompt=
"Problem:\nIf $\det \mathbf{{A}} = 2$ and $\det \mathbf{{B}} = 12,$ then find $\det (\mathbf{{A}} \mathbf{{B}}).$\nSolution:"
),
dict(
role="BOT",
prompt=
"We have that $\det (\mathbf{{A}} \mathbf{{B}}) = (\det \mathbf{{A}})(\det \mathbf{{B}}) = (2)(12) = \\boxed{{24}}.$\nFinal Answer: The final answer is $24$. I hope it is correct.\n"
),
dict(
role="HUMAN",
prompt=
"Problem:\nTerrell usually lifts two 20-pound weights 12 times. If he uses two 15-pound weights instead, how many times must Terrell lift them in order to lift the same total weight?\nSolution:"
),
dict(
role="BOT",
prompt=
"If Terrell lifts two 20-pound weights 12 times, he lifts a total of $2\cdot 12\cdot20=480$ pounds of weight. If he lifts two 15-pound weights instead for $n$ times, he will lift a total of $2\cdot15\cdot n=30n$ pounds of weight. Equating this to 480 pounds, we can solve for $n$: \\begin{{align*}} 30n&=480\\\\ \Rightarrow\qquad n&=480/30=\\boxed{{16}} \end{{align*}}\nFinal Answer: The final answer is $16$. I hope it is correct.\n"
),
dict(
role="HUMAN",
prompt=
"Problem:\nIf the system of equations: \\begin{{align*}} 6x-4y&=a,\\\\ 6y-9x &=b. \end{{align*}}has a solution $(x, y)$ where $x$ and $y$ are both nonzero, find $\\frac{{a}}{{b}},$ assuming $b$ is nonzero.\nSolution:"
),
dict(
role="BOT",
prompt=
"If we multiply the first equation by $-\\frac{{3}}{{2}}$, we obtain $$6y-9x=-\\frac{{3}}{{2}}a.$$Since we also know that $6y-9x=b$, we have $$-\\frac{{3}}{{2}}a=b\Rightarrow\\frac{{a}}{{b}}=\\boxed{{-\\frac{{2}}{{3}}}}.$$\nFinal Answer: The final answer is $-\\frac{{2}}{{3}}$. I hope it is correct.\n"
),
dict(role="HUMAN", prompt="Problem:\n{problem}\nSolution:\n"),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512))
math_eval_cfg = dict(
evaluator=dict(type=MATHEvaluator), pred_postprocessor=dict(type=math_postprocess))
math_datasets = [
dict(
type=MATHDataset,
abbr='math',
path='./data/math/math.json',
reader_cfg=math_reader_cfg,
infer_cfg=math_infer_cfg,
eval_cfg=math_eval_cfg)
]

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .mbpp_gen_1e1056 import mbpp_datasets # noqa: F401, F403

View File

@ -1,42 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import MBPPDataset, MBPPEvaluator2
mbpp_reader_cfg = dict(
input_columns=['text', 'test_list'], output_column='test_list_2')
# This prompt is used for WizardLMCode series
# You can use other config file for basic 3-shot generation
mbpp_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=
"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Create a Python script for this problem:
{text}
Test examples:
{test_list}
### Response:"""),
])),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512))
mbpp_eval_cfg = dict(evaluator=dict(type=MBPPEvaluator2), pred_role="BOT")
mbpp_datasets = [
dict(
type=MBPPDataset,
abbr='mbpp',
path='./data/mbpp/mbpp.jsonl',
reader_cfg=mbpp_reader_cfg,
infer_cfg=mbpp_infer_cfg,
eval_cfg=mbpp_eval_cfg)
]

View File

@ -1,27 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import MBPPDataset, MBPPEvaluator
mbpp_reader_cfg = dict(
input_columns=['text', 'test_list'], output_column='test_list_2')
mbpp_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=
"You are an expert Python programmer, and here is your task: Write a function to find the similar elements from the given two tuple lists. Your code should pass these tests:\n\n assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)\n assert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4) \n assert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14) \n[BEGIN]\n 'def similar_elements(test_tup1, test_tup2):\r\n res = tuple(set(test_tup1) & set(test_tup2))\r\n return (res)' \n[DONE] \n\n You are an expert Python programmer, and here is your task: Write a python function to identify non-prime numbers. Your code should pass these tests:\n\n assert is_not_prime(2) == False \n assert is_not_prime(10) == True \n assert is_not_prime(35) == True \n[BEGIN]\n 'import math\r\ndef is_not_prime(n):\r\n result = False\r\n for i in range(2,int(math.sqrt(n)) + 1):\r\n if n % i == 0:\r\n result = True\r\n return result' \n[DONE] \n\n You are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests:\n\n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] \n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] \n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35] \n[BEGIN]\n 'import heapq as hq\r\ndef heap_queue_largest(nums,n):\r\n largest_nums = hq.nlargest(n, nums)\r\n return largest_nums' \n[DONE] \n\n You are an expert Python programmer, and here is your task: {text} Your code should pass these tests:\n\n {test_list} \n[BEGIN]\n"),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512))
mbpp_eval_cfg = dict(evaluator=dict(type=MBPPEvaluator))
mbpp_datasets = [
dict(
type=MBPPDataset,
abbr='mbpp',
path='./data/mbpp/mbpp.jsonl',
reader_cfg=mbpp_reader_cfg,
infer_cfg=mbpp_infer_cfg,
eval_cfg=mbpp_eval_cfg)
]

View File

@ -1,64 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import MBPPDataset, MBPPEvaluator
mbpp_reader_cfg = dict(
input_columns=['text', 'test_list'], output_column='test_list_2')
mbpp_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role="HUMAN",
prompt=
"You are an expert Python programmer, and here is your task: Write a function to find the similar elements from the given two tuple lists. Your code should pass these tests:\n\n assert similar_elements((3, 4, 5, 6),(5, 7, 4, 10)) == (4, 5)\n assert similar_elements((1, 2, 3, 4),(5, 4, 3, 7)) == (3, 4) \n assert similar_elements((11, 12, 14, 13),(17, 15, 14, 13)) == (13, 14) \n"
),
dict(
role="BOT",
prompt=
"[BEGIN]\n 'def similar_elements(test_tup1, test_tup2):\r\n res = tuple(set(test_tup1) & set(test_tup2))\r\n return (res)' \n[DONE] \n\n "
),
dict(
role="HUMAN",
prompt=
"You are an expert Python programmer, and here is your task: Write a python function to identify non-prime numbers. Your code should pass these tests:\n\n assert is_not_prime(2) == False \n assert is_not_prime(10) == True \n assert is_not_prime(35) == True \n"
),
dict(
role="BOT",
prompt=
"[BEGIN]\n 'import math\r\ndef is_not_prime(n):\r\n result = False\r\n for i in range(2,int(math.sqrt(n)) + 1):\r\n if n % i == 0:\r\n result = True\r\n return result' \n[DONE] \n\n "
),
dict(
role="HUMAN",
prompt=
"You are an expert Python programmer, and here is your task: Write a function to find the largest integers from a given list of numbers using heap queue algorithm. Your code should pass these tests:\n\n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],3)==[85, 75, 65] \n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],2)==[85, 75] \n assert heap_queue_largest( [25, 35, 22, 85, 14, 65, 75, 22, 58],5)==[85, 75, 65, 58, 35] \n"
),
dict(
role="BOT",
prompt=
"[BEGIN]\n 'import heapq as hq\r\ndef heap_queue_largest(nums,n):\r\n largest_nums = hq.nlargest(n, nums)\r\n return largest_nums' \n[DONE] \n\n "
),
dict(
role="HUMAN",
prompt=
"You are an expert Python programmer, and here is your task: {text} Your code should pass these tests:\n\n {test_list} \n"
),
dict(role="BOT", prompt="[BEGIN]\n"),
], )),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
mbpp_eval_cfg = dict(evaluator=dict(type=MBPPEvaluator), pred_role="BOT")
mbpp_datasets = [
dict(
type=MBPPDataset,
abbr='mbpp',
path='./data/mbpp/mbpp.jsonl',
reader_cfg=mbpp_reader_cfg,
infer_cfg=mbpp_infer_cfg,
eval_cfg=mbpp_eval_cfg)
]

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .mmlu_gen_a484b3 import mmlu_datasets # noqa: F401, F403

View File

@ -1,124 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import FixKRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import MMLUDataset
from opencompass.utils.text_postprocessors import first_capital_postprocess
# None of the mmlu dataset in huggingface is correctly parsed, so we use our own dataset reader
# Please download the dataset from https://people.eecs.berkeley.edu/~hendrycks/data.tar
mmlu_reader_cfg = dict(
input_columns=["input", "A", "B", "C", "D"],
output_column="target",
train_split='dev')
mmlu_all_sets = [
"college_biology",
"college_chemistry",
"college_computer_science",
"college_mathematics",
"college_physics",
"electrical_engineering",
"astronomy",
"anatomy",
"abstract_algebra",
"machine_learning",
"clinical_knowledge",
"global_facts",
"management",
"nutrition",
"marketing",
"professional_accounting",
"high_school_geography",
"international_law",
"moral_scenarios",
"computer_security",
"high_school_microeconomics",
"professional_law",
"medical_genetics",
"professional_psychology",
"jurisprudence",
"world_religions",
"philosophy",
"virology",
"high_school_chemistry",
"public_relations",
"high_school_macroeconomics",
"human_sexuality",
"elementary_mathematics",
"high_school_physics",
"high_school_computer_science",
"high_school_european_history",
"business_ethics",
"moral_disputes",
"high_school_statistics",
"miscellaneous",
"formal_logic",
"high_school_government_and_politics",
"prehistory",
"security_studies",
"high_school_biology",
"logical_fallacies",
"high_school_world_history",
"professional_medicine",
"high_school_mathematics",
"college_medicine",
"high_school_us_history",
"sociology",
"econometrics",
"high_school_psychology",
"human_aging",
"us_foreign_policy",
"conceptual_physics",
]
mmlu_datasets = []
for _name in mmlu_all_sets:
_hint = f'There is a single choice question about {_name.replace("_", " ")}. Answer the question by replying A, B, C or D.'
mmlu_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role="HUMAN",
prompt=
f"{_hint}\nQ: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nA: "
),
dict(role="BOT", prompt="{target}\n")
]),
),
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin="</E>",
round=[
dict(
role="HUMAN",
prompt=
f"{_hint}\nQ: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nA: "
),
],
),
ice_token="</E>",
),
retriever=dict(type=FixKRetriever, fix_id_list=[0, 1, 2, 3, 4]),
inferencer=dict(type=GenInferencer),
)
mmlu_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_postprocessor=dict(type=first_capital_postprocess))
mmlu_datasets.append(
dict(
abbr=f"lukaemon_mmlu_{_name}",
type=MMLUDataset,
path="./data/mmlu/",
name=_name,
reader_cfg=mmlu_reader_cfg,
infer_cfg=mmlu_infer_cfg,
eval_cfg=mmlu_eval_cfg,
))
del _name, _hint

View File

@ -1,110 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import FixKRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import MMLUDataset
from opencompass.utils.text_postprocessors import first_capital_postprocess
# None of the mmlu dataset in huggingface is correctly parsed, so we use our own dataset reader
# Please download the dataset from https://people.eecs.berkeley.edu/~hendrycks/data.tar
mmlu_reader_cfg = dict(
input_columns=["input", "A", "B", "C", "D"],
output_column="target",
train_split='dev')
mmlu_all_sets = [
"college_biology",
"college_chemistry",
"college_computer_science",
"college_mathematics",
"college_physics",
"electrical_engineering",
"astronomy",
"anatomy",
"abstract_algebra",
"machine_learning",
"clinical_knowledge",
"global_facts",
"management",
"nutrition",
"marketing",
"professional_accounting",
"high_school_geography",
"international_law",
"moral_scenarios",
"computer_security",
"high_school_microeconomics",
"professional_law",
"medical_genetics",
"professional_psychology",
"jurisprudence",
"world_religions",
"philosophy",
"virology",
"high_school_chemistry",
"public_relations",
"high_school_macroeconomics",
"human_sexuality",
"elementary_mathematics",
"high_school_physics",
"high_school_computer_science",
"high_school_european_history",
"business_ethics",
"moral_disputes",
"high_school_statistics",
"miscellaneous",
"formal_logic",
"high_school_government_and_politics",
"prehistory",
"security_studies",
"high_school_biology",
"logical_fallacies",
"high_school_world_history",
"professional_medicine",
"high_school_mathematics",
"college_medicine",
"high_school_us_history",
"sociology",
"econometrics",
"high_school_psychology",
"human_aging",
"us_foreign_policy",
"conceptual_physics",
]
mmlu_datasets = []
for _name in mmlu_all_sets:
_hint = f'The following are multiple choice questions (with answers) about {_name.replace("_", " ")}.\n\n'
mmlu_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template=
"{input}\nA. {A}\nB. {B}\nC. {C}\nD. {D}\nAnswer: {target}\n",
),
prompt_template=dict(
type=PromptTemplate,
template=
f"{_hint}</E>{{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nAnswer:",
ice_token="</E>",
),
retriever=dict(type=FixKRetriever, fix_id_list=[0, 1, 2, 3, 4]),
inferencer=dict(type=GenInferencer),
)
mmlu_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_postprocessor=dict(type=first_capital_postprocess),
)
mmlu_datasets.append(
dict(
abbr=f"lukaemon_mmlu_{_name}",
type=MMLUDataset,
path="./data/mmlu/",
name=_name,
reader_cfg=mmlu_reader_cfg,
infer_cfg=mmlu_infer_cfg,
eval_cfg=mmlu_eval_cfg,
))
del _name, _hint

View File

@ -1,124 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import FixKRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import MMLUDataset
from opencompass.utils.text_postprocessors import first_capital_postprocess
# None of the mmlu dataset in huggingface is correctly parsed, so we use our own dataset reader
# Please download the dataset from https://people.eecs.berkeley.edu/~hendrycks/data.tar
mmlu_reader_cfg = dict(
input_columns=["input", "A", "B", "C", "D"],
output_column="target",
train_split='dev')
mmlu_all_sets = [
"college_biology",
"college_chemistry",
"college_computer_science",
"college_mathematics",
"college_physics",
"electrical_engineering",
"astronomy",
"anatomy",
"abstract_algebra",
"machine_learning",
"clinical_knowledge",
"global_facts",
"management",
"nutrition",
"marketing",
"professional_accounting",
"high_school_geography",
"international_law",
"moral_scenarios",
"computer_security",
"high_school_microeconomics",
"professional_law",
"medical_genetics",
"professional_psychology",
"jurisprudence",
"world_religions",
"philosophy",
"virology",
"high_school_chemistry",
"public_relations",
"high_school_macroeconomics",
"human_sexuality",
"elementary_mathematics",
"high_school_physics",
"high_school_computer_science",
"high_school_european_history",
"business_ethics",
"moral_disputes",
"high_school_statistics",
"miscellaneous",
"formal_logic",
"high_school_government_and_politics",
"prehistory",
"security_studies",
"high_school_biology",
"logical_fallacies",
"high_school_world_history",
"professional_medicine",
"high_school_mathematics",
"college_medicine",
"high_school_us_history",
"sociology",
"econometrics",
"high_school_psychology",
"human_aging",
"us_foreign_policy",
"conceptual_physics",
]
mmlu_datasets = []
for _name in mmlu_all_sets:
_hint = f'There is a single choice question about {_name.replace("_", " ")}. Answer the question by replying A, B, C or D.'
mmlu_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role="HUMAN",
prompt=
f"{_hint}\nQuestion: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nAnswer: "
),
dict(role="BOT", prompt="{target}\n")
]),
),
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin="</E>",
round=[
dict(
role="HUMAN",
prompt=
f"{_hint}\nQ: {{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nA: "
),
],
),
ice_token="</E>",
),
retriever=dict(type=FixKRetriever, fix_id_list=[0, 1, 2, 3, 4]),
inferencer=dict(type=GenInferencer),
)
mmlu_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_postprocessor=dict(type=first_capital_postprocess))
mmlu_datasets.append(
dict(
abbr=f"lukaemon_mmlu_{_name}",
type=MMLUDataset,
path="./data/mmlu/",
name=_name,
reader_cfg=mmlu_reader_cfg,
infer_cfg=mmlu_infer_cfg,
eval_cfg=mmlu_eval_cfg,
))
del _name, _hint

View File

@ -1,113 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import FixKRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import MMLUDataset
# None of the mmlu dataset in huggingface is correctly parsed, so we use our own dataset reader
# Please download the dataset from https://people.eecs.berkeley.edu/~hendrycks/data.tar
mmlu_reader_cfg = dict(
input_columns=["input", "A", "B", "C", "D"],
output_column="target",
train_split='dev')
mmlu_all_sets = [
"college_biology",
"college_chemistry",
"college_computer_science",
"college_mathematics",
"college_physics",
"electrical_engineering",
"astronomy",
"anatomy",
"abstract_algebra",
"machine_learning",
"clinical_knowledge",
"global_facts",
"management",
"nutrition",
"marketing",
"professional_accounting",
"high_school_geography",
"international_law",
"moral_scenarios",
"computer_security",
"high_school_microeconomics",
"professional_law",
"medical_genetics",
"professional_psychology",
"jurisprudence",
"world_religions",
"philosophy",
"virology",
"high_school_chemistry",
"public_relations",
"high_school_macroeconomics",
"human_sexuality",
"elementary_mathematics",
"high_school_physics",
"high_school_computer_science",
"high_school_european_history",
"business_ethics",
"moral_disputes",
"high_school_statistics",
"miscellaneous",
"formal_logic",
"high_school_government_and_politics",
"prehistory",
"security_studies",
"high_school_biology",
"logical_fallacies",
"high_school_world_history",
"professional_medicine",
"high_school_mathematics",
"college_medicine",
"high_school_us_history",
"sociology",
"econometrics",
"high_school_psychology",
"human_aging",
"us_foreign_policy",
"conceptual_physics",
]
mmlu_datasets = []
for _name in mmlu_all_sets:
_hint = f'The following are multiple choice questions (with answers) about {_name.replace("_", " ")}.\n\n'
mmlu_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template={
opt:
f"{{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nAnswer: {opt}\n"
for opt in ["A", "B", "C", "D"]
},
),
prompt_template=dict(
type=PromptTemplate,
template={
opt:
f"{_hint}</E>{{input}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}\nAnswer: {opt}"
for opt in ["A", "B", "C", "D"]
},
ice_token="</E>",
),
retriever=dict(type=FixKRetriever, fix_id_list=[0, 1, 2, 3, 4]),
inferencer=dict(type=PPLInferencer),
)
mmlu_eval_cfg = dict(evaluator=dict(type=AccEvaluator), )
mmlu_datasets.append(
dict(
abbr=f"lukaemon_mmlu_{_name}",
type=MMLUDataset,
path="./data/mmlu/",
name=_name,
reader_cfg=mmlu_reader_cfg,
infer_cfg=mmlu_infer_cfg,
eval_cfg=mmlu_eval_cfg,
))
del _name, _hint

View File

@ -1,49 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import RaceDataset
race_reader_cfg = dict(
input_columns=['article', 'question', 'A', 'B', 'C', 'D'],
output_column='answer',
train_split="validation",
test_split="test"
)
race_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'A':
'Read the article, and answer the question by replying A, B, C or D.\n\n{article}\n\nQ: {question}\n\nA. {A}\nB. {B}\nC. {C}\nD. {D}\n\nAnswer: A',
'B':
'Read the article, and answer the question by replying A, B, C or D.\n\n{article}\n\nQ: {question}\n\nA. {A}\nB. {B}\nC. {C}\nD. {D}\n\nAnswer: B',
'C':
'Read the article, and answer the question by replying A, B, C or D.\n\n{article}\n\nQ: {question}\n\nA. {A}\nB. {B}\nC. {C}\nD. {D}\n\nAnswer: C',
'D':
'Read the article, and answer the question by replying A, B, C or D.\n\n{article}\n\nQ: {question}\n\nA. {A}\nB. {B}\nC. {C}\nD. {D}\n\nAnswer: D',
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
race_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
race_datasets = [
dict(
abbr='race-middle',
type=RaceDataset,
path='./data/race',
name='middle',
reader_cfg=race_reader_cfg,
infer_cfg=race_infer_cfg,
eval_cfg=race_eval_cfg),
dict(
abbr='race-high',
type=RaceDataset,
path='./data/race',
name='high',
reader_cfg=race_reader_cfg,
infer_cfg=race_infer_cfg,
eval_cfg=race_eval_cfg)
]

View File

@ -1,61 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import LMEvaluator
from opencompass.datasets.subjective_cmp import SubjectiveCmpDataset
subjective_reader_cfg = dict(
input_columns=['question', 'index', 'reference_answer', 'evaluating_guidance', 'capability', 'prompt'],
output_column=None,
train_split='test')
subjective_all_sets = [
"subjective_demo",
]
subjective_datasets = []
for _name in subjective_all_sets:
subjective_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt="{question}"
),
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
subjective_eval_cfg = dict(
evaluator=dict(
type=LMEvaluator,
cmp_order='both',
prompt_template=dict(
type=PromptTemplate,
template=dict(
begin=[
dict(
role="SYSTEM",
fallback_role="HUMAN",
prompt="{prompt}"
),
],
round=[dict(role="HUMAN",
prompt="回答 1: <回答 1 开始> {prediction} <回答 1 结束>\n回答 2: <回答 2 开始> {prediction2} <回答 2 结束>\n")]))),
pred_role="BOT",
)
subjective_datasets.append(
dict(
abbr=f"{_name}",
type=SubjectiveCmpDataset,
path="./data/subjective/",
name=_name,
reader_cfg=subjective_reader_cfg,
infer_cfg=subjective_infer_cfg,
eval_cfg=subjective_eval_cfg
))

View File

@ -1,56 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import TydiQADataset, TydiQAEvaluator
# All configs are for TydiQA Goldp task
tydiqa_reader_cfg = dict(
input_columns=["passage_text", "question_text"],
output_column="answer"
)
langs = ['arabic', 'bengali', 'english', 'finnish', 'indonesian', 'japanese', 'korean', 'russian', 'swahili', 'telugu', 'thai']
prefixs_prompt = {
"english": ("Answer the following question based on the information in the given passage.", "Passage:", "Question:", "Answer:"),
"arabic": ("أجب على السؤال التالي بناءً على المعلومات في المقطع المعطى.", "المقطع:", "السؤال:", "الإجابة:"),
"bengali": ("প্রদত্ত অধ্যায়ের তথ্যের উপর ভিত্তি করে নিম্নলিখিত প্রশ্নের উত্তর দিন।", "অধ্যায়:", "প্রশ্ন:", "উত্তর:"),
"finnish": ("Vastaa seuraavaan kysymykseen annetun kappaleen tiedon perusteella.", "Kappale:", "Kysymys:", "Vastaus:"),
"indonesian": ("Jawab pertanyaan berikut berdasarkan informasi di bagian yang diberikan.", "Bagian:", "Pertanyaan:", "Jawaban:"),
"korean": ("주어진 문단의 정보에 기반하여 다음 질문에 답하십시오.", "문단:", "질문:", "답변:"),
"japanese":("文脈に基づいて質問に答えてください。","ぶんしょう:","しつもん:", "かいとう:"),
"russian": ("Ответьте на следующий вопрос на основе информации в данном отрывке.", "Отрывок:", "Вопрос:", "Ответ:"),
"swahili": ("Jibu swali lifuatalo kulingana na habari kwenye kifungu kilichotolewa.", "Kifungu:", "Swali:", "Jibu:"),
"telugu": ("ఇచ్చిన పేరాలోని సమాచారం ఆధారంగా కింది ప్రశ్నకు సమాధానం ఇవ్వండి.", "పేరా:", "ప్రశ్న:", "సమాధానం:"),
"thai":("ตอบคำถามต่อไปนี้โดยอิงตามข้อมูลในตอนข้อความที่กำหนด:", "ตอนข้อความ:", "คำถาม:", "คำตอบ:")
}
tydiqa_datasets = []
for _lang in langs:
_hint = prefixs_prompt[_lang]
tydiqa_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=f"{_hint[0]}\n\n</E>{_hint[1]}{{passage_text}}\n{_hint[2]} {{question_text}}\n{_hint[3]} {{answer}}" ,
ice_token='</E>'
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer), max_out_len=50
)
tydiqa_eval_cfg = dict(
evaluator=dict(type=TydiQAEvaluator),
ds_split='validation',
ds_column='answer',
)
tydiqa_datasets.append(
dict(abbr=f'tydiqa-goldp_{_lang}',
type=TydiQADataset,
path='./data/tydiqa',
lang=_lang,
reader_cfg=tydiqa_reader_cfg,
infer_cfg=tydiqa_infer_cfg,
eval_cfg=tydiqa_eval_cfg
)
)

View File

@ -1,25 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import HFDataset
z_bench_reader_cfg = dict(
input_columns=['text'], output_column='category', train_split='test')
z_bench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template='{text}',
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
z_bench_datasets = dict(
type=HFDataset,
path=
'/mnt/petrelfs/gaotong/llm_eval/openagieval_dataset/eval_datasets/z_bench',
data_dir=
'/mnt/petrelfs/gaotong/llm_eval/openagieval_dataset/eval_datasets/z_bench',
name='question',
reader_cfg=z_bench_reader_cfg,
infer_cfg=z_bench_infer_cfg)

View File

@ -1,28 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import HFDataset
z_bench_reader_cfg = dict(
ds_size=4,
input_columns=['text'],
output_column='category',
train_split='test')
z_bench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[dict(role="HUMAN", prompt="{text}")]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer))
z_bench_datasets = dict(
type=HFDataset,
path=
'/mnt/petrelfs/gaotong/llm_eval/openagieval_dataset/eval_datasets/z_bench',
data_dir=
'/mnt/petrelfs/gaotong/llm_eval/openagieval_dataset/eval_datasets/z_bench',
name='question',
reader_cfg=z_bench_reader_cfg,
infer_cfg=z_bench_infer_cfg)

View File

@ -1,11 +0,0 @@
from mmengine.config import read_base
with read_base():
from .datasets.ceval.ceval_gen import ceval_datasets
from .datasets.cmmlu.cmmlu_gen import cmmlu_datasets
from .datasets.agieval.agieval_gen import agieval_datasets
from .datasets.bbh.bbh_gen import bbh_datasets
from .datasets.mmlu.mmlu_gen import mmlu_datasets
from .models.alaya.alaya import models
datasets = [*bbh_datasets, *ceval_datasets, *cmmlu_datasets, *agieval_datasets, *mmlu_datasets]

View File

@ -1,80 +0,0 @@
from mmengine.config import read_base
from opencompass.partitioners import SizePartitioner
from opencompass.runners import LocalRunner, SlurmRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.models import OpenAI
from opencompass.lagent.actions.ipython_interpreter import IPythonInterpreter
from opencompass.lagent.agents.react import CIReAct
from opencompass.models.lagent import CodeAgent
from lagent.agents.react import ReActProtocol
with read_base():
from .datasets.CIBench.CIBench_gen_eb42f9 import cibench_datasets as datasets
FORCE_STOP_PROMPT_EN = """You should directly give results based on history information."""
FEWSHOT_INSTRUCTION = """\
You are an assistant who can utilize external tools.
{tool_description}
To use a tool, please response with the following format:
```
{thought} Think what you need to solve, do you need to use tools?
{action} The tool name, should be one of [{action_names}].
{action_input} The input to the tool that you want to use.
```
The tool will give you response after your response using the following format:
```
{response} the results after call the tool.
```
Therefore DO NOT generate tool response by yourself.
Also please follow the guidelines:
1. Always use code interpreter to solve the problem.
2. The generated codes should always in a markdown code block format.
3. The generated codes will be executed in an ipython manner and the results will be cached.
4. Your responded code should always be simple and only solves the problem in current step.
Begin!
"""
models = [
dict(
abbr='gpt-3.5-turbo',
type=CodeAgent,
agent_type=CIReAct,
mutli_rounds=True,
max_turn=3,
llm=dict(
type=OpenAI,
path='gpt-3.5-turbo',
key='ENV',
query_per_second=1,
max_seq_len=4096,
),
actions=[
dict(
type=IPythonInterpreter,
description=
'''It can run Python code in a manner as jupyter notebook. The code must be a valid code that contains only python method.
'''),
],
protocol=dict(
type=ReActProtocol,
call_protocol=FEWSHOT_INSTRUCTION,
force_stop=FORCE_STOP_PROMPT_EN,
action=dict(role='ACTION', begin='Tool:', end='\n'),
action_input=dict(role='ARGS', begin='Tool Input:', end='\n'),
response=dict(role='RESPONSE', begin='Tool Response:', end='\n'),
finish=dict(role='FINISH', begin='Final Answer:', end='\n'),
),
batch_size=8,
),
]
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=50, gen_task_coef=1),
runner=dict(
type=SlurmRunner, max_num_workers=8, retry=2,
task=dict(type=OpenICLInferTask)),
)

View File

@ -1,91 +0,0 @@
from mmengine.config import read_base
from opencompass.datasets.circular import (CircularCEvalDataset, CircularMMLUDataset, CircularCMMLUDataset, CircularCSQADataset,
CircularARCDataset, CircularHSWAGDataset, CircularOBQADataset, CircularRaceDataset, CircularEvaluator)
from opencompass.summarizers import CircularSummarizer
with read_base():
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.cmmlu.cmmlu_gen_c13365 import cmmlu_datasets
from .datasets.hellaswag.hellaswag_gen_6faab5 import hellaswag_datasets
from .datasets.ARC_e.ARC_e_gen_1e0de5 import ARC_e_datasets
from .datasets.ARC_c.ARC_c_gen_1e0de5 import ARC_c_datasets
from .datasets.commonsenseqa.commonsenseqa_gen_1da2d0 import commonsenseqa_datasets
from .datasets.obqa.obqa_gen_9069e4 import obqa_datasets
from .datasets.race.race_gen_69ee4f import race_datasets
from .models.hf_internlm.hf_internlm_chat_7b import models as hf_internlm_chat_7b_model
from .models.hf_internlm.hf_internlm_chat_20b import models as hf_internlm_chat_20b_model
from .models.qwen.hf_qwen_7b_chat import models as hf_qwen_7b_chat_model
from .models.qwen.hf_qwen_14b_chat import models as hf_qwen_14b_chat_model
from .summarizers.groups.mmlu import mmlu_summary_groups
from .summarizers.groups.cmmlu import cmmlu_summary_groups
from .summarizers.groups.ceval import ceval_summary_groups
for ds, t in [
(ceval_datasets, CircularCEvalDataset),
(mmlu_datasets, CircularMMLUDataset),
(cmmlu_datasets, CircularCMMLUDataset),
(hellaswag_datasets, CircularHSWAGDataset),
(ARC_e_datasets, CircularARCDataset),
(ARC_c_datasets, CircularARCDataset),
(commonsenseqa_datasets, CircularCSQADataset),
(obqa_datasets, CircularOBQADataset),
(race_datasets, CircularRaceDataset),
]:
for d in ds:
d['type'] = t
d['abbr'] = d['abbr'] + '-circular-4'
d['eval_cfg']['evaluator'] = {'type': CircularEvaluator, 'circular_pattern': 'circular'}
d['circular_patterns'] = 'circular'
datasets = sum([v for k, v in locals().items() if k.endswith("_datasets") or k == 'datasets'], [])
models = sum([v for k, v in locals().items() if k.endswith("_model")], [])
# config summarizer
other_summary_groups = [
{'name': 'average',
'subsets': ['ceval', 'mmlu', 'cmmlu', 'hellaswag', 'ARC-e', 'ARC-c', 'commonsense_qa', 'openbookqa_fact', 'race-middle', 'race-high']},
]
origin_summary_groups = sum([v for k, v in locals().items() if k.endswith("_summary_groups")], [])
new_summary_groups = []
for item in origin_summary_groups:
new_summary_groups.append(
{
'name': item['name'] + '-circular-4',
'subsets': [i + '-circular-4' for i in item['subsets']],
}
)
summarizer = dict(
type=CircularSummarizer,
metric_types=['acc_origin', 'perf_circular'],
dataset_abbrs = [
'average-circular-4',
'ceval-circular-4',
'mmlu-circular-4',
'cmmlu-circular-4',
'hellaswag-circular-4',
'ARC-e-circular-4',
'ARC-c-circular-4',
'commonsense_qa-circular-4',
'openbookqa_fact-circular-4',
'race-middle-circular-4',
'race-high-circular-4',
'ceval-humanities-circular-4',
'ceval-stem-circular-4',
'ceval-social-science-circular-4',
'ceval-other-circular-4',
'mmlu-humanities-circular-4',
'mmlu-stem-circular-4',
'mmlu-social-science-circular-4',
'mmlu-other-circular-4',
'cmmlu-humanities-circular-4',
'cmmlu-stem-circular-4',
'cmmlu-social-science-circular-4',
'cmmlu-other-circular-4',
'cmmlu-china-specific-circular-4',
],
summary_groups=new_summary_groups,
)

View File

@ -1,65 +0,0 @@
# This config is used for pass@k evaluation with dataset repetition
# That model cannot generate multiple response for single input
from mmengine.config import read_base
from opencompass.partitioners import SizePartitioner
from opencompass.models import HuggingFaceCausalLM
from opencompass.runners import LocalRunner
from opencompass.partitioners import SizePartitioner
from opencompass.tasks import OpenICLInferTask
from opencompass.datasets import MBPPDataset_V2, MBPPPassKEvaluator
with read_base():
from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
from .datasets.mbpp.mbpp_gen_1e1056 import mbpp_datasets
humaneval_datasets[0]['abbr'] = 'openai_humaneval_pass10'
humaneval_datasets[0]['num_repeats'] = 10
mbpp_datasets[0]['abbr'] = 'mbpp_pass10'
mbpp_datasets[0]['num_repeats'] = 10
mbpp_datasets[0]['type'] = MBPPDataset_V2
mbpp_datasets[0]['eval_cfg']['evaluator']['type'] = MBPPPassKEvaluator
mbpp_datasets[0]['reader_cfg']['output_column'] = 'test_column'
datasets = []
datasets += humaneval_datasets
datasets += mbpp_datasets
_meta_template = dict(
round=[
dict(role="HUMAN", begin="<|User|>:", end="\n"),
dict(role="BOT", begin="<|Bot|>:", end="<eoa>\n", generate=True),
],
)
models = [
dict(
abbr="internlm-chat-7b-hf-v11",
type=HuggingFaceCausalLM,
path="internlm/internlm-chat-7b-v1_1",
tokenizer_path="internlm/internlm-chat-7b-v1_1",
tokenizer_kwargs=dict(
padding_side="left",
truncation_side="left",
use_fast=False,
trust_remote_code=True,
),
max_seq_len=2048,
meta_template=_meta_template,
model_kwargs=dict(trust_remote_code=True, device_map="auto"),
generation_kwargs=dict(
do_sample=True,
top_p=0.95,
temperature=0.8,
),
run_cfg=dict(num_gpus=1, num_procs=1),
batch_size=8,
)
]
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=600),
runner=dict(
type=LocalRunner, max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)

View File

@ -1,52 +0,0 @@
from mmengine.config import read_base
from opencompass.partitioners import SizePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.models import OpenAI, HuggingFaceCausalLM
from opencompass.models.lagent import CodeAgent
with read_base():
from .datasets.math.math_gen_943d32 import math_datasets
from .datasets.gsm8k.gsm8k_gen_57b0b1 import gsm8k_datasets
datasets = []
datasets += gsm8k_datasets
datasets += math_datasets
models = [
dict(
abbr='gpt-3.5-react',
type=CodeAgent,
llm=dict(
type=OpenAI,
path='gpt-3.5-turbo',
key='ENV',
query_per_second=1,
max_seq_len=4096,
),
batch_size=8),
dict(
abbr='WizardCoder-Python-13B-V1.0-react',
type=CodeAgent,
llm=dict(
type=HuggingFaceCausalLM,
path="WizardLM/WizardCoder-Python-13B-V1.0",
tokenizer_path='WizardLM/WizardCoder-Python-13B-V1.0',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_seq_len=2048,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
),
batch_size=8,
run_cfg=dict(num_gpus=2, num_procs=1)),
]
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=40000),
runner=dict(
type=LocalRunner, max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)

View File

@ -1,7 +0,0 @@
from mmengine.config import read_base
with read_base():
from .datasets.humanevalx.humanevalx_gen import humanevalx_datasets
from .models.codegeex2.hf_codegeex2_6b import models
datasets = humanevalx_datasets

View File

@ -1,10 +0,0 @@
from mmengine.config import read_base
with read_base():
from .datasets.siqa.siqa_gen import siqa_datasets
from .datasets.winograd.winograd_ppl import winograd_datasets
from .models.opt.hf_opt_125m import opt125m
from .models.opt.hf_opt_350m import opt350m
datasets = [*siqa_datasets, *winograd_datasets]
models = [opt125m, opt350m]

View File

@ -1,36 +0,0 @@
from mmengine.config import read_base
from opencompass.models import OpenAI
from opencompass.partitioners import NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
# choose a list of datasets
from .datasets.collections.chat_medium import datasets
# and output the results in a choosen format
from .summarizers.medium import summarizer
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
)
models = [
dict(abbr='GPT-3.5-turbo-0613',
type=OpenAI, path='gpt-3.5-turbo-0613',
key='ENV', # The key will be obtained from $OPENAI_API_KEY, but you can write down your key here as well
meta_template=api_meta_template,
query_per_second=1,
max_out_len=2048, max_seq_len=4096, batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
max_num_workers=8,
task=dict(type=OpenICLInferTask)),
)

View File

@ -1,40 +0,0 @@
from mmengine.config import read_base
from opencompass.models import OpenAI
from opencompass.partitioners import NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from .datasets.collections.chat_medium import datasets
from .summarizers.medium import summarizer
# GPT4 needs a special humaneval postprocessor
from opencompass.datasets.humaneval import humaneval_gpt_postprocess
for _dataset in datasets:
if _dataset['path'] == 'openai_humaneval':
_dataset['eval_cfg']['pred_postprocessor']['type'] = humaneval_gpt_postprocess
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
)
models = [
dict(abbr='GPT4',
type=OpenAI, path='gpt-4-0613',
key='ENV', # The key will be obtained from $OPENAI_API_KEY, but you can write down your key here as well
meta_template=api_meta_template,
query_per_second=1,
max_out_len=2048, max_seq_len=2048, batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
max_num_workers=4,
task=dict(type=OpenICLInferTask)),
)

View File

@ -1,8 +0,0 @@
from mmengine.config import read_base
with read_base():
from .datasets.collections.base_medium_llama import piqa_datasets, siqa_datasets
from .models.hf_llama.hf_llama_7b import models
datasets = [*piqa_datasets, *siqa_datasets]

View File

@ -1,9 +0,0 @@
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from .datasets.collections.base_medium import datasets
# choose a model of interest
from .models.internlm.internlm_7b import models
# and output the results in a choosen format
from .summarizers.medium import summarizer

View File

@ -1,9 +0,0 @@
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from .datasets.collections.base_medium import datasets
# choose a model of interest
from .models.hf_internlm.hf_internlm_7b import models
# and output the results in a choosen format
from .summarizers.medium import summarizer

View File

@ -1,116 +0,0 @@
from mmengine.config import read_base
from opencompass.models.turbomind import TurboMindModel
with read_base():
# choose a list of datasets
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import WiC_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
from .datasets.race.race_gen_69ee4f import race_datasets
from .datasets.crowspairs.crowspairs_gen_381af0 import crowspairs_datasets
# and output the results in a choosen format
from .summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
meta_template = dict(
round=[
dict(role='HUMAN', begin='<|User|>:', end='\n'),
dict(role='BOT', begin='<|Bot|>:', end='<eoa>\n', generate=True),
],
eos_token_id=103028)
# config for internlm-chat-7b
# models = [
# dict(
# type=TurboMindModel,
# abbr='internlm-chat-7b-turbomind',
# path="./turbomind",
# max_out_len=100,
# max_seq_len=2048,
# batch_size=32,
# concurrency=32,
# meta_template=meta_template,
# run_cfg=dict(num_gpus=1, num_procs=1),
# )
# ]
# config for internlm-chat-7b-w4 model
# models = [
# dict(
# type=TurboMindModel,
# abbr='internlm-chat-7b-w4-turbomind',
# path="./turbomind",
# max_out_len=100,
# max_seq_len=2048,
# batch_size=32,
# concurrency=32,
# meta_template=meta_template,
# run_cfg=dict(num_gpus=1, num_procs=1),
# )
# ]
# config for internlm-chat-7b-w4kv8 model
# models = [
# dict(
# type=TurboMindModel,
# abbr='internlm-chat-7b-w4kv8-turbomind',
# path="./turbomind",
# max_out_len=100,
# max_seq_len=2048,
# batch_size=32,
# concurrency=32,
# meta_template=meta_template,
# run_cfg=dict(num_gpus=1, num_procs=1),
# )
# ]
# config for internlm-chat-20b
# models = [
# dict(
# type=TurboMindModel,
# abbr='internlm-chat-20b-turbomind',
# path="./turbomind",
# max_out_len=100,
# max_seq_len=2048,
# batch_size=8,
# concurrency=8,
# meta_template=meta_template,
# run_cfg=dict(num_gpus=1, num_procs=1),
# )
# ]
# config for internlm-chat-20b-w4 model
models = [
dict(
type=TurboMindModel,
abbr='internlm-chat-20b-w4-turbomind',
path="./turbomind",
max_out_len=100,
max_seq_len=2048,
batch_size=16,
concurrency=16,
meta_template=meta_template,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
# config for internlm-chat-20b-w4kv8 model
# models = [
# dict(
# type=TurboMindModel,
# abbr='internlm-chat-20b-w4kv8-turbomind',
# path="./turbomind",
# max_out_len=100,
# max_seq_len=2048,
# batch_size=16,
# concurrency=16,
# meta_template=meta_template,
# run_cfg=dict(num_gpus=1, num_procs=1),
# )
# ]

View File

@ -1,40 +0,0 @@
from mmengine.config import read_base
from opencompass.models.turbomind_tis import TurboMindTisModel
with read_base():
# choose a list of datasets
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import WiC_datasets
from .datasets.SuperGLUE_WSC.SuperGLUE_WSC_gen_6dc406 import WSC_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
from .datasets.race.race_gen_69ee4f import race_datasets
from .datasets.crowspairs.crowspairs_gen_381af0 import crowspairs_datasets
# and output the results in a choosen format
from .summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
meta_template = dict(
round=[
dict(role='HUMAN', begin='<|User|>:', end='\n'),
dict(role='BOT', begin='<|Bot|>:', end='<eoa>\n', generate=True),
],
eos_token_id=103028)
models = [
dict(
type=TurboMindTisModel,
abbr='internlm-chat-20b-turbomind',
path="internlm",
tis_addr='0.0.0.0:33337',
max_out_len=100,
max_seq_len=2048,
batch_size=8,
meta_template=meta_template,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]

View File

@ -1,101 +0,0 @@
from mmengine.config import read_base
from opencompass.models.turbomind import TurboMindModel
with read_base():
# choose a list of datasets
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import WiC_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
# and output the results in a choosen format
from .summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
# # config for internlm-7b model
# models = [
# dict(
# type=TurboMindModel,
# abbr='internlm-7b-turbomind',
# path="./turbomind",
# max_out_len=100,
# max_seq_len=2048,
# batch_size=32,
# concurrency=32,
# run_cfg=dict(num_gpus=1, num_procs=1),
# )
# ]
# # config for internlm-7b-w4 model
# models = [
# dict(
# type=TurboMindModel,
# abbr='internlm-7b-w4-turbomind',
# path="./turbomind",
# max_out_len=100,
# max_seq_len=2048,
# batch_size=32,
# concurrency=32,
# run_cfg=dict(num_gpus=1, num_procs=1),
# )
# ]
# # config for internlm-7b-w4kv8 model
# models = [
# dict(
# type=TurboMindModel,
# abbr='internlm-7b-w4kv8-turbomind',
# path="./turbomind",
# max_out_len=100,
# max_seq_len=2048,
# batch_size=32,
# concurrency=32,
# run_cfg=dict(num_gpus=1, num_procs=1),
# )
# ]
# config for internlm-20b model
models = [
dict(
type=TurboMindModel,
abbr='internlm-20b-turbomind',
path="./turbomind",
max_out_len=100,
max_seq_len=2048,
batch_size=8,
concurrency=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]
# config for internlm-20b-w4 model
# models = [
# dict(
# type=TurboMindModel,
# abbr='internlm-20b-w4-turbomind',
# path="./turbomind",
# max_out_len=100,
# max_seq_len=2048,
# batch_size=16,
# concurrency=16,
# run_cfg=dict(num_gpus=1, num_procs=1),
# )
# ]
# config for internlm-20b-w4kv8 model
# models = [
# dict(
# type=TurboMindModel,
# abbr='internlm-20b-w4kv8-turbomind',
# path="./turbomind",
# max_out_len=100,
# max_seq_len=2048,
# batch_size=16,
# concurrency=16,
# run_cfg=dict(num_gpus=1, num_procs=1),
# )
# ]

View File

@ -1,28 +0,0 @@
from mmengine.config import read_base
from opencompass.models.turbomind_tis import TurboMindTisModel
with read_base():
# choose a list of datasets
from .datasets.mmlu.mmlu_gen_a484b3 import mmlu_datasets
from .datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from .datasets.SuperGLUE_WiC.SuperGLUE_WiC_gen_d06864 import WiC_datasets
from .datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
from .datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
from .datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
# and output the results in a choosen format
from .summarizers.medium import summarizer
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
models = [
dict(
type=TurboMindTisModel,
abbr='internlm-chat-20b-turbomind',
path="internlm",
tis_addr='0.0.0.0:33337',
max_out_len=100,
max_seq_len=2048,
batch_size=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]

View File

@ -1,33 +0,0 @@
from mmengine.config import read_base
from opencompass.models import LightllmAPI
from opencompass.partitioners import NaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from .datasets.humaneval.humaneval_gen import humaneval_datasets
datasets = [*humaneval_datasets]
models = [
dict(
abbr='LightllmAPI',
type=LightllmAPI,
url='http://localhost:8080/generate',
max_out_len=1024,
batch_size=8,
generation_kwargs=dict(
do_sample=False,
ignore_eos=False,
),
),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalRunner,
max_num_workers=8,
task=dict(type=OpenICLInferTask),
),
)

View File

@ -1,8 +0,0 @@
from mmengine.config import read_base
with read_base():
from .datasets.collections.base_medium_llama import piqa_datasets, siqa_datasets
from .models.llama.llama2_7b import models
datasets = [*piqa_datasets, *siqa_datasets]

View File

@ -1,148 +0,0 @@
from mmengine.config import read_base
from opencompass.partitioners import SizePartitioner
from opencompass.runners import LocalRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.openicl import AgentInferencer
with read_base():
from .summarizers.medium import summarizer
from .datasets.gsm8k.gsm8k_gen import gsm8k_datasets as datasets
from opencompass.models.lagent import LagentAgent
from lagent.llms import GPTAPI
from lagent.agents.react import ReAct, ReActProtocol
from lagent.actions import PythonInterpreter
FORCE_STOP_PROMPT_EN = """You should directly give results based on history information."""
FEWSHOT_INSTRUCTION = """\
You are a assistant who can utilize external tools.
{tool_description}
To use a tool, please use the following format:
```
{thought} Think what you need to solve, do you need to use tools?
{action} the tool name, should be one of [{action_names}]
{action_input} the input to the action
```
I will give you response after utilizing tools should using the following format:
```
{response} the results after call the tool.
``
If you already know the answer, or you do not need to use tools,
please using the following format to reply:
```
{thought} the thought process to get the final answer
{finish} final answer
```
Examples:
<HUMAN>A group of 4 fruit baskets contains 9 apples, 15 oranges, and 14 bananas in the first three baskets and 2 less of each fruit in the fourth basket. How many fruits are there?
<ASSISTANT>{thought} We need to calculate the total number of fruits. The total number of fruits in the first three baskets is given, while for the fourth basket, we need to subtract 2 from each fruit category. We can solve this problem using simple arithmetic.
{action} PythonInterpreter
{action_input}
```python
def solution():
# Fruits in the first three baskets
apples_first_three = 9
oranges_first_three = 15
bananas_first_three = 14
# Fruits in the fourth basket
apples_fourth = apples_first_three - 2
oranges_fourth = oranges_first_three - 2
bananas_fourth = bananas_first_three - 2
# Total fruits
total_fruits = ((apples_first_three + oranges_first_three + bananas_first_three) * 3 +
apples_fourth + oranges_fourth + bananas_fourth)
return {{"total_fruits": total_fruits}}
```
<SYSTEM>{response}{{'total_fruits': 146}}
<ASSISTANT> {thought} By adding the given numbers of apples, oranges, and bananas in the first three baskets, then subtracting 2 from each category for the fourth basket, we have found the total number of fruits.
{finish} 146
<HUMAN>Bella has two times as many marbles as frisbees. She also has 20 more frisbees than deck cards. If she buys 2/5 times more of each item, what would be the total number of the items she will have if she currently has 60 marbles?
<ASSISTANT>{thought} This is a problem that requires solving equations. We know the relationship between the number of marbles, frisbees, and deck cards. Bella has twice as many marbles as frisbees, and 20 more frisbees than deck cards. Finally, we are told Bella buys 2/5 times more of each item. This purchasing will increase the number of each type of item.
{action} PythonInterpreter
{action_input}
```python
def solution():
# Given number of marbles
marbles_now = 60
# Calculate number of frisbees and deck cards now
frisbees_now = marbles_now / 2
cards_now = frisbees_now - 20
# Calculate number of each item after buying more
marbles_then = marbles_now + (2/5) * marbles_now
frisbees_then = frisbees_now + (2/5) * frisbees_now
cards_then = cards_now + (2/5)*cards_now
# Total number of items then
total_items = marbles_then + frisbees_then + cards_then
return {{"total_items": total_items}}
```
<SYSTEM>{response}{{'total_items': 140.0}}
<ASSISTANT>{thought} By establishing the relationships between the numbers of marbles, frisbees, and deck cards that Bella currently has, we can calculate how many of each item she will have after buying 2/5 more of each. Adding these quantities together gives us the total number of items.
{finish} 140
Begin!
"""
PYTHON_INTERPRETER_DESCRIPTION = '''\
It can run a Python code. The code must be a valid code that contains only python method, and the method' name must be 'solution' and returns a dict, which key is variable name. The libraries I recommend are sympy and scipy. the format is:
```python
# import packages
import xxx
def solution():
# initialize some variables
variable_names_with_real_meaning = xxx
# middle steps
mid_variable = func(mid_variable)
# final answer
final_answer = func(mid_variable)
return final_answer
```'''
models = [
dict(abbr='gpt-3.5-react',
type=LagentAgent,
agent_type=ReAct,
max_turn=3,
llm=dict(
type=GPTAPI,
model_type='gpt-3.5-turbo',
key='ENV',
query_per_second=1,
max_seq_len=4096,
),
actions=[
dict(type=PythonInterpreter,
description=PYTHON_INTERPRETER_DESCRIPTION),
],
protocol=dict(
type=ReActProtocol,
call_protocol=FEWSHOT_INSTRUCTION,
force_stop=FORCE_STOP_PROMPT_EN,
finish=dict(role='FINISH', begin='Final Answer:', end='\n'),
),
batch_size=8),
]
for dataset in datasets:
# Use AgentInferencer instead of GenInferencer
dataset['infer_cfg']['inferencer'] = dict(type=AgentInferencer)
# Use the question as agent input directly.
dataset['infer_cfg']['prompt_template']['template'] = "{question}"
infer = dict(
partitioner=dict(type=SizePartitioner, max_task_size=1000),
runner=dict(
type=LocalRunner,
max_num_workers=16,
task=dict(type=OpenICLInferTask)),
)

View File

@ -1,11 +0,0 @@
from mmengine.config import read_base
with read_base():
from .models.qwen.hf_qwen_7b_chat import models
from .datasets.lawbench.lawbench_zero_shot_gen_002588 import lawbench_datasets as lawbench_zero_shot_datasets
from .datasets.lawbench.lawbench_one_shot_gen_002588 import lawbench_datasets as lawbench_one_shot_datasets
from .summarizers.lawbench import summarizer
datasets = lawbench_zero_shot_datasets + lawbench_one_shot_datasets
for d in datasets:
d["infer_cfg"]["inferencer"]["save_every"] = 1

View File

@ -1,24 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
abbr='aquila2-7b-hf',
path="BAAI/Aquila2-7B",
tokenizer_path='BAAI/Aquila2-7B',
model_kwargs=dict(
device_map='auto',
trust_remote_code=True,
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
use_fast=False,
),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]

View File

@ -1,21 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
abbr='baichuan2-13b-base-hf',
path="baichuan-inc/Baichuan2-13B-Base",
tokenizer_path='baichuan-inc/Baichuan2-13B-Base',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
use_fast=False,
),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(device_map='auto', trust_remote_code=True),
run_cfg=dict(num_gpus=2, num_procs=1),
)
]

View File

@ -1,21 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
abbr='baichuan2-7b-base-hf',
path="baichuan-inc/Baichuan2-7B-Base",
tokenizer_path='baichuan-inc/Baichuan2-7B-Base',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
use_fast=False,
),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(device_map='auto', trust_remote_code=True),
run_cfg=dict(num_gpus=1, num_procs=1),
)
]

View File

@ -1,24 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
abbr='bluelm-7b-base-hf',
path="vivo-ai/BlueLM-7B-Base",
tokenizer_path='vivo-ai/BlueLM-7B-Base',
model_kwargs=dict(
device_map='auto',
trust_remote_code=True,
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
use_fast=False,
),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]

View File

@ -1,24 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
abbr='bluelm-7b-base-32k-hf',
path="vivo-ai/BlueLM-7B-Base-32K",
tokenizer_path='vivo-ai/BlueLM-7B-Base-32K',
model_kwargs=dict(
device_map='auto',
trust_remote_code=True,
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
use_fast=False,
),
max_out_len=100,
max_seq_len=4096,
batch_size=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]

View File

@ -1,31 +0,0 @@
from opencompass.models import HuggingFaceChatGLM3
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
)
models = [
dict(
type=HuggingFaceChatGLM3,
abbr='chatglm3-6b-hf',
path='THUDM/chatglm3-6b',
tokenizer_path='THUDM/chatglm3-6b',
model_kwargs=dict(
device_map='auto',
trust_remote_code=True,
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
meta_template=api_meta_template,
max_out_len=100,
max_seq_len=4096,
batch_size=1,
run_cfg=dict(num_gpus=1, num_procs=1)
)
]

View File

@ -1,24 +0,0 @@
from opencompass.models import HuggingFace
models = [
dict(
type=HuggingFace,
abbr='chatglm3-6b-base-hf',
path='THUDM/chatglm3-6b-base',
tokenizer_path='THUDM/chatglm3-6b-base',
model_kwargs=dict(
trust_remote_code=True,
device_map='auto',
),
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=100,
max_seq_len=4096,
batch_size=8,
run_cfg=dict(num_gpus=1, num_procs=1),
)
]

View File

@ -1,21 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
# CodeLlama 13B
dict(
type=HuggingFaceCausalLM,
abbr='CodeLlama-13b',
path="codellama/CodeLlama-13b-hf",
tokenizer_path='codellama/CodeLlama-13b-hf',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=1024,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
run_cfg=dict(num_gpus=2, num_procs=1),
),
]

View File

@ -1,21 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
# CodeLlama 13B Instruct
dict(
type=HuggingFaceCausalLM,
abbr='CodeLlama-13b-Instruct',
path="codellama/CodeLlama-13b-Instruct-hf",
tokenizer_path='codellama/CodeLlama-13b-Instruct-hf',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=1024,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
run_cfg=dict(num_gpus=2, num_procs=1),
),
]

View File

@ -1,21 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
# CodeLlama 13B Python
dict(
type=HuggingFaceCausalLM,
abbr='CodeLlama-13b-Python',
path="codellama/CodeLlama-13b-Python-hf",
tokenizer_path='codellama/CodeLlama-13b-Python-hf',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=1024,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
run_cfg=dict(num_gpus=2, num_procs=1),
),
]

View File

@ -1,21 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
# CodeLlama 34B
dict(
type=HuggingFaceCausalLM,
abbr='CodeLlama-34b',
path="codellama/CodeLlama-34b-hf",
tokenizer_path='codellama/CodeLlama-34b-hf',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=1024,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
run_cfg=dict(num_gpus=4, num_procs=1),
),
]

View File

@ -1,21 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
# CodeLlama 34B Instruct
dict(
type=HuggingFaceCausalLM,
abbr='CodeLlama-34b-Instruct',
path="codellama/CodeLlama-34b-Instruct-hf",
tokenizer_path='codellama/CodeLlama-34b-Instruct-hf',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=1024,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
run_cfg=dict(num_gpus=4, num_procs=1),
),
]

View File

@ -1,21 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
# CodeLlama 34B Python
dict(
type=HuggingFaceCausalLM,
abbr='CodeLlama-34b-Python',
path="codellama/CodeLlama-34b-Python-hf",
tokenizer_path='codellama/CodeLlama-34b-Python-hf',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=1024,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
run_cfg=dict(num_gpus=4, num_procs=1),
),
]

View File

@ -1,21 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
# CodeLlama 7B
dict(
type=HuggingFaceCausalLM,
abbr='CodeLlama-7b',
path="codellama/CodeLlama-7b-hf",
tokenizer_path='codellama/CodeLlama-7b-hf',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=1024,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
run_cfg=dict(num_gpus=1, num_procs=1),
),
]

View File

@ -1,21 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
# CodeLlama 7B Instruct
dict(
type=HuggingFaceCausalLM,
abbr='CodeLlama-7b-Instruct',
path="codellama/CodeLlama-7b-Instruct-hf",
tokenizer_path='codellama/CodeLlama-7b-Instruct-hf',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=1024,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
run_cfg=dict(num_gpus=1, num_procs=1),
),
]

View File

@ -1,21 +0,0 @@
from opencompass.models import HuggingFaceCausalLM
models = [
# CodeLlama 7B Python
dict(
type=HuggingFaceCausalLM,
abbr='CodeLlama-7b-Python',
path="codellama/CodeLlama-7b-Python-hf",
tokenizer_path='codellama/CodeLlama-7b-Python-hf',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=1024,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(trust_remote_code=True, device_map='auto'),
run_cfg=dict(num_gpus=1, num_procs=1),
),
]

View File

@ -1,21 +0,0 @@
# Only torch >=2.0 is supported for falcon-40b
from opencompass.models import HuggingFaceCausalLM
models = [
dict(
type=HuggingFaceCausalLM,
abbr='falcon-40b-hf',
path='tiiuae/falcon-40b',
tokenizer_path='tiiuae/falcon-40b',
tokenizer_kwargs=dict(
padding_side='left',
truncation_side='left',
trust_remote_code=True,
),
max_out_len=100,
max_seq_len=2048,
batch_size=8,
model_kwargs=dict(trust_remote_code=True, device_map='auto', revision='561820f7eef0cc56a31ea38af15ca1acb07fab5d'),
run_cfg=dict(num_gpus=4, num_procs=1),
)
]

Some files were not shown because too many files have changed in this diff Show More