Compare commits

...

203 Commits
0.3.5 ... main

Author SHA1 Message Date
Yu Sun
d572761cef
[Dataset] Add Smolinstruct configs (#2127)
Some checks failed
lint / lint (push) Has been cancelled
* 0-shot Smolinstruct

Add 0-shot evaluation and postprocess functions for Smolinstruct

* fix acc postprocessor

* update 0-shot acc postprocessor

* rename 0-shot
2025-05-29 14:09:08 +08:00
Linchen Xiao
408f5caff4
[Dataset] Add SuperGPQA subfield configs (#2124)
* update

* fix lint

* fix lint

* update precommit

* update precommit

* fix lint
2025-05-28 14:12:58 +08:00
Myhs_phz
6f3c670b99
add qwen3 lmdeply (#2126) 2025-05-27 19:41:13 +08:00
zhulinJulia24
c3779ebfc1
[ci] update dlc setting (#2112) 2025-05-22 16:47:57 +08:00
Songyang Zhang
aa2b89b6f8
[Update] Add CascadeEvaluator with Data Replica (#2022)
* Update CascadeEvaluator

* Update CascadeEvaluator

* Update CascadeEvaluator

* Update Config

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update
2025-05-20 16:46:55 +08:00
Dongsheng Zhu
7a7a4517ab
[Update] History code bench pass@k update (#2102)
* bigcodebench

* humaneval

* humanevalx

* humanevalx

* livecodebench

* mbpp

* humaneval_plus

* fix bug

* template

* max_out fix

* template update
2025-05-19 17:03:33 +08:00
kkscilife
8c0ccf9a6b
[CI] Fix Lint error (#2103) 2025-05-16 15:36:45 +08:00
kkscilife
6f3b6a5d12
[CI] Add gitleaks check (#2101) 2025-05-16 14:34:57 +08:00
tcheng
3d1760aba2
[Dataset] Add Scieval (#2089)
* style: pass all formatting hooks (yapf & quote fixer)

* revise name:Add Lifescience Sub-set Support for MMLU & SciEval (datasets + configs + loader)

* revise name:Add Lifescience SciEval (datasets + configs + loader+dataset-index.yml)

* Add Lifescience SciEval (datasets + configs + loader+dataset-index.yml)

* all categories of SciEval (datasets + configs + loader+dataset-index.yml)

* revise name:Add Lifescience SciEval (datasets + configs + loader+dataset-index.yml)

* revise :SciEval 5shot

---------

Co-authored-by: root <tangcheng231@mails.ucas.edu.cn>
2025-05-14 10:25:03 +08:00
Wei Li
b84518c656
[Dataset] Support MedMCQA and MedBullets benchmark (#2054)
* support medmcqa and medbullets benchmark

* Add Medbullets data folder for benchmark support

* revise gen name

* revise config file & remove csv file & add dataset info to dataset-index.yml

* remove csv file

* remove print in medbullets.py

* revise class name

* update_oss_info

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-05-13 17:10:50 +08:00
zhulinJulia24
d60f59dcab
[CI] update baseline and fix lmdeploy version (#2098)
* update

* update

* update

* update

* update

* update
2025-05-13 14:01:47 +08:00
bittersweet1999
9eaa1f6fec
Update icl_judge_evaluator.py (#2095) 2025-05-13 10:44:24 +08:00
Linchen Xiao
d590f557bb
[Update] OpenaiSDK handle empty content (#2096) 2025-05-12 19:38:30 +08:00
yuehua-s
c492e49e79
[Update] Add o4 in OpenaiSDK (#2083)
* feature:1.add o4-mini;2.o3 or o4-mini only support temperature==1

* feature:change 4o-mini to 4o

---------

Co-authored-by: yuehuazhang <yuehuazhang@tencent.com>
2025-05-12 18:39:44 +08:00
Dongsheng Zhu
2c79dc5227
[Dataset] Add human_eval/mbpp pro (#2092)
* add bench

* update

* bug fix

* time update

* add index

* fix repeat bug
2025-05-12 18:38:13 +08:00
huihui1999
345674f700
[Dataset] Add SciknowEval Dataset (#2070)
* first

* first

* first

* first

* SciKnowEval

* fix hash

* fix dataset-index & use official llm_judge_postprocess

* fix dataset-index.yml

* use official llmjudge_postprocess

* fix lint

* fix lint

* fix lint

* fix lint

* fix lint

* merge with main

---------

Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2025-05-12 17:23:44 +08:00
Kun Yuan
8aa18df368
[Dataset] HLE Biomedical version support (#2080)
* HLE Biomedical version support

* set up default category value for hle
2025-05-12 10:14:11 +08:00
huihui1999
44a7024ed5
[Dataset] MedCalc_Bench (#2072)
* MedCalc_Bench

* MedCal_Bench

* add hash

* fix hash

* fix comments &dataset-index yml

* fix lint

* fix lint

* fix lint

* fix lint

* fix lint

---------

Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2025-05-09 16:58:55 +08:00
Linchen Xiao
508e2b0cb2
[Update] Set load_from_cache_file to False (#2085) 2025-05-09 15:21:47 +08:00
Kun Yuan
7bdd3c1904
[Dataset] MMLU_Pro Biomedical Version Support (#2081) 2025-05-09 15:07:26 +08:00
Jin Ye
6097186a95
[Datasets] MedQA, ProteinLMBench; Add Models: huatuogpt, baichuanM1 (#2064)
* Add Datasets: MedQA, ProteinLMBench; Add Models: huatuogpt, baichuanM1

* Fix bugs for MedQA. Add info in dataset-index

* Add version code for MedQA and ProteinLMBench

* Add version code for MedQA and ProteinLMBench
2025-05-09 14:47:44 +08:00
Linchen Xiao
d72df59363
[Revert] Add Lifescience Sub-set Support for SciEval (#2059) (#2087)
This reverts commit c5048bfec7.
2025-05-09 14:46:27 +08:00
tcheng
c5048bfec7
[Dataset] Add Lifescience Sub-set Support for SciEval (#2059)
* style: pass all formatting hooks (yapf & quote fixer)

* revise name:Add Lifescience Sub-set Support for MMLU & SciEval (datasets + configs + loader)

* revise name:Add Lifescience SciEval (datasets + configs + loader+dataset-index.yml)

* Add Lifescience SciEval (datasets + configs + loader+dataset-index.yml)

---------

Co-authored-by: root <tangcheng231@mails.ucas.edu.cn>
2025-05-09 14:31:12 +08:00
huihui1999
a7f3ac20b2
[Dataset] Add CARDBiomedBench (#2071)
* CARDBiomedBench

* fix hash

* fix dataset-index

* use official llmjudge postprocess

* use official llmjudge_postprocess

* fix lint

* fix init
2025-05-08 19:44:46 +08:00
Mo Li
ff3275edf0
[Update] Add Long-Context configs for Gemma, OREAL, and Qwen2.5 models (#2048)
* [Update] Update Gemma, Oreal, Qwen Config

* fix lint
2025-05-08 19:06:56 +08:00
Wei Li
a685ed7daf
[Dataset] Add nejm ai benchmark (#2063)
* support nejm ai benchmark

* add dataset files

* revise gen name

* revise gen name

* revise class name & remove csv file & add dataset-index.yml info

* update

* update

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-05-08 16:44:05 +08:00
Jiahao Xu
9ec23c145b
[Datasets] Add ClinicBench, PubMedQA and ScienceQA (#2061)
* Add ClinicBench

* Add PubMedQA & ScienceQA & ClinicBench

* Add PubMedQA & ScienceQA & ClinicBench

* Update datasets_info & hf_path

* Update hf_path
2025-05-08 16:25:43 +08:00
Dongsheng Zhu
ba0e32292c
[Feature] Support InternSandbox (#2049)
* internsandbox init

* internsandbox

* dataset_index

* dataset_index_add
2025-05-07 16:42:09 +08:00
谢昕辰
43b2c4ed76
[Fix] Update lawbench data path (#2037) 2025-05-07 16:18:43 +08:00
Dongsheng Zhu
d62b69aaef
[Fix] Fix InternVL model config (#2068)
* intervl-8b&38b

* intervl adjustment

* internvl fix
2025-05-07 15:51:18 +08:00
Linchen Xiao
af8432e1d6
[Update] OpenAI SDK model reasoning content (#2078)
* update

* update

* update
2025-05-07 14:06:40 +08:00
bittersweet1999
ddc9cc0afb
[Add] add a config to Judge dataset all (#2077)
* fix pip version

* fix pip version

* add judgedatasetall

* add judgedatasetall

* add judgedatasetall
2025-05-07 10:57:23 +08:00
bittersweet1999
37cbaf8d92
[Add] Add Judgerbenchv2 (#2067)
* fix pip version

* fix pip version

* add judgerbenchv2

* Update __init__.py
2025-04-30 17:12:34 +08:00
Taolin Zhang
b6148aa198
add Judgebench (#2066)
* add rewardbench

* add rewardbench

* add rmb datasets

* add rmb datasets

* add judgebench

* add judgebench
2025-04-30 15:01:10 +08:00
bittersweet1999
527a80947b
[Add] Add writingbench (#2028)
* fix pip version

* fix pip version

* add writingbench

* add writingbench

* add writingbench

* add writingbench
2025-04-29 16:29:32 +08:00
Taolin Zhang
8c74e6a39e
add RMB Bench (#2056)
* add rewardbench

* add rewardbench

* add rmb datasets

* add rmb datasets
2025-04-27 16:26:01 +08:00
Linchen Xiao
e8bc8c1e8c
[Bug] Concat OpenaiSDK reasoning content (#2041)
* [Bug] Concat OpenaiSDK reasoning content

* [Bug] Concat OpenaiSDK reasoning content

* update

* update
2025-04-25 14:10:33 +08:00
Junnan Liu
97010dc4ce
[Update] Update dataset repeat concatenation (#2039) 2025-04-23 16:16:28 +08:00
Linchen Xiao
dcbf899369
[Bug] Fix SmolInsturct logger import (#2036) 2025-04-23 11:10:30 +08:00
Linchen Xiao
bf74f26603
[Update] Safe SmolInstruct meteor calculation (#2033) 2025-04-22 18:27:48 +08:00
Linchen Xiao
455bb05d1b
[Update] Update dataset configs (#2030)
* [Update] Update dataset configs

* Fix lint
2025-04-21 18:55:06 +08:00
Taolin Zhang
c69110361b
[Add] add rewardbench (#2029)
* add rewardbench

* add rewardbench
2025-04-21 17:18:51 +08:00
JuchengHu
a2093a81ef
[Dataset] Matbench (#2021)
* add support for matbench

* fix dataset path

* fix data load

* fix

* fix lint

---------

Co-authored-by: Jucheng Hu <jucheng.hu.20@ucl.ac.uk>
Co-authored-by: Myhs-phz <demarcia2014@126.com>
2025-04-21 15:50:47 +08:00
Linchen Xiao
b2da1c08a8
[Dataset] Add SmolInstruct, Update Chembench (#2025)
* [Dataset] Add SmolInstruct, Update Chembench

* Add dataset metadata

* update

* update

* update
2025-04-18 17:21:29 +08:00
Linchen Xiao
65ff602cf5
[Update] Fix LLM Judge metrics cacluation & Add reasoning content concat to OpenAI SDK 2025-04-15 11:33:16 +08:00
Myhs_phz
75e7834b59
[Feature] Add Datasets: ClimateQA,Physics (#2017)
* feat ClimateQA

* feat PHYSICS

* fix

* fix

* fix

* fix
2025-04-14 20:18:47 +08:00
Linchen Xiao
6a6a1a5c0b
[Feature] LLM Judge sanity check (#2012)
* update

* update
2025-04-11 19:01:39 +08:00
bittersweet1999
3f50b1dc49
[Fix] fix order bug Update arena_hard.py (#2015) 2025-04-11 16:59:40 +08:00
Junnan Liu
20660ab507
[Fix] Fix compare error when k is list in base_evaluator (#2010)
* fix gpass compare error of list k

* fix compare error in 177
2025-04-10 19:47:21 +08:00
Linchen Xiao
12213207b6
[Refactor] Refactorize openicl eval task (#1990)
* [Refactor] Refactorize openicl eval task

* update
2025-04-09 15:52:23 +08:00
zhulinJulia24
6ac9b06bc2
[ci] update baseline for kernal change of vllm and lmdeploy (#2011)
* update

* update

* update

* update

* update

* update

* update
2025-04-09 14:09:35 +08:00
Linchen Xiao
a05f9da134
[Feature] Make dump-eval-details default behavior (#1999)
* Update

* update

* update
2025-04-08 14:42:26 +08:00
Myhs_phz
fd82bea747
[Fix] OpenICL Math Evaluator Config (#2007)
* fix

* fix recommended

* fix

* fix

* fix

* fix
2025-04-08 14:38:35 +08:00
Linchen Xiao
bb58cfc85d
[Feature] Add CascadeEvaluator (#1992)
* [Feature] Add CascadeEvaluator

* update

* updat
2025-04-08 11:58:14 +08:00
Jin Ye
b564e608b1
[Dataset] Add MedXpertQA (#2002)
* Add MedXpertQA

* Add MedXpertQA

* Add MedXpertQA

* Fix lint

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-04-08 10:44:48 +08:00
shijinpjlab
828fb745c9
[Dataset] Update dingo 1.5.0 (#2008)
Co-authored-by: shiin <shijin@pjlab.org.cn>
2025-04-07 17:21:15 +08:00
zhulinJulia24
f982d6278e
[CI] fix baseline score (#2000)
* update

* update

* update

* update

* update

* update

* update

* updaste

* update

* update

* updaste

* updaste

* update

* update

* update

* update

* update

* update

* update

* update
2025-04-03 19:32:36 +08:00
Myhs_phz
3a9a384173
[Doc] Fix links between zh & en (#2001)
* test

* test

* test
2025-04-03 17:37:53 +08:00
Myhs_phz
9b489e9ea0
[Update] Revert math500 dataset configs (#1998) 2025-04-03 15:11:02 +08:00
Linchen Xiao
dc8deb6af0
[BUMP] Bump version to 0.4.2 (#1997) 2025-04-02 17:47:15 +08:00
liushz
32d6859679
[Feature] Add olymmath dataset (#1982)
* Add olymmath dataset

* Add olymmath dataset

* Add olymmath dataset

* Update olymmath dataset
2025-04-02 17:34:07 +08:00
zhulinJulia24
97236c8e97
[CI] Fix baseline score (#1996)
* update

* update

* update

* update
2025-04-02 14:25:16 +08:00
Linchen Xiao
f66b0b347a
[Update] Requirements update (#1993) 2025-04-02 12:03:45 +08:00
Dongsheng Zhu
330a6e5ca7
[Update] Add Intervl-8b&38b model configs (#1978) 2025-04-01 11:51:37 +08:00
Myhs_phz
f71eb78c72
[Doc] Add TBD Token in Datasets Statistics (#1986)
* feat

* doc

* doc

* doc

* doc
2025-03-31 19:08:55 +08:00
Linchen Xiao
0f46c35211
[Bug] Aime2024 config fix (#1974)
Some checks failed
lint / lint (push) Has been cancelled
* [Bug] Aime2024 config fix

* fix
2025-03-25 17:57:11 +08:00
Myhs_phz
6118596362
[Feature] Add recommendation configs for datasets (#1937)
* feat datasetrefine drop

* fix datasets in fullbench_int3

* fix

* fix

* back

* fix

* fix and doc

* feat

* fix hook

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* doc

* fix

* fix

* Update dataset-index.yml
2025-03-25 14:54:13 +08:00
Linchen Xiao
07930b854a
[Update] Add Korbench config with no max_out_len (#1968)
Some checks are pending
lint / lint (push) Waiting to run
* Add Korbench no max_out_len

* Add Korbench no max_out_len
2025-03-24 18:38:06 +08:00
Myhs_phz
37307fa996
[Update] Add QWQ32b model config (#1959)
Some checks are pending
lint / lint (push) Waiting to run
* feat qwq-32b

* fix

* feat phi_4

---------

Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2025-03-24 14:51:39 +08:00
Linchen Xiao
db96161a4e
[Update] Add SuperGPQA subset metrics (#1966) 2025-03-24 14:25:12 +08:00
Linchen Xiao
aa05993922
[Update] Add dataset configurations of no max_out_len (#1967)
* [Update] Add dataset configurations of no max_out_len

* update test torch version

* update test torch version

* update test torch version

* update test torch version
2025-03-24 14:24:12 +08:00
Linchen Xiao
64128916d0
[Update] Increase memory size for CPU job of VOLC Runner (#1962)
* [Update] Increase memory size for CPU job of VOLC Runner

* [Update] Increase memory size for CPU job of VOLC Runner
2025-03-24 11:21:14 +08:00
Dongsheng Zhu
8a5029b121
[Feature] Add MultiPL-E & Code Evaluator (#1963)
* multiple_code develop

* multiple_code update

* comments upadate

* index upadate
2025-03-21 20:09:25 +08:00
Linchen Xiao
b9de8b0e2b
[Update] Unset disallowed_special token for Openai model (#1960) 2025-03-18 20:24:07 +08:00
Songyang Zhang
c98599271b
[Update] Update OlympiadBench and Update LLM Judge (#1954) 2025-03-18 20:15:20 +08:00
Jason Cheung
5d2d253d83
[BUG] Fix model_kwargs pass logic for vllm (#1958) 2025-03-18 20:08:15 +08:00
Linchen Xiao
0b7f76e193
[Bug] Fix Summarizer logic (#1953) 2025-03-17 18:25:08 +08:00
Yufeng Zhao
15c825a51a
[Update] Bbeh harmony summarizer added (#1951)
* bbeh

* bbeh

* fix_smallbugs_bbeh

* removeprint

* harmonic

* update_summerizer

* harmonic-tested

* harmonic-tested

* clean

* clean

* cleaned_rebased

---------

Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2025-03-17 17:19:56 +08:00
Linchen Xiao
854c6bf025
[Update] Update requirement and base evaluator 2025-03-13 20:52:50 +08:00
Linchen Xiao
1c60e3a0f6
[Update] Add configurations for llmjudge dataset (#1940)
* Add configurations for llmjudge dataset

* update
2025-03-13 17:30:04 +08:00
liushz
709bc4af0e
[Update] Add AIME2025 oss info (#1936)
* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* update dataset path

* Update olmpiadBench

* Update olmpiadBench

* Update olmpiadBench

* Add HLE dataset

* Add HLE dataset

* Add HLE dataset

* Add AIME2025 oss info

---------

Co-authored-by: sudanl <sudanl@foxmail.com>
2025-03-12 18:41:16 +08:00
Yufeng Zhao
bc2969dba8
[Feature] Add support for BBEH dataset (#1925)
* bbeh

* bbeh

* fix_smallbugs_bbeh

* removeprint

* results

---------

Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2025-03-12 10:53:31 +08:00
Kangreen
59e49aedf1
[Feature] Support SuperGPQA (#1924)
* support supergpqa

* remove unnecessary code

* remove unnecessary code

* Add Readme

* Add Readme

* fix lint

* fix lint

* update

* update

---------

Co-authored-by: mkj3085003 <mkj3085003@gmail.com>
Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-03-11 19:32:08 +08:00
Linchen Xiao
e403fd21be
[Fix] Fix math-verify evaluator (#1917)
* update

* update

* update
2025-03-11 17:35:04 +08:00
Linchen Xiao
cbf84fb33c
[Feature] Update LLM Evaluation for MMLU-Pro (#1923) 2025-03-07 21:01:20 +08:00
Myhs_phz
570c30cf1b
[Fix] Fix CLI option for results persistence (#1920)
* fix

* fix

* fix
2025-03-07 18:24:30 +08:00
Shudong Liu
277d7946f5
[Fix] Fix typo in deepseed_r1.md (#1916) 2025-03-05 19:37:22 +08:00
Myhs_phz
1585c0adbe
[Feature] Evaluation Results Persistence (#1894)
* feat results_station.py

* lint

* feat save_to_station

* feat result_station.py and lint

* feat

* fix

* fix and lint

* fix

* fix subjective processing

* fix

* fix

* style function name

* lint
2025-03-05 18:33:34 +08:00
Myhs_phz
54324657f0
[Docs] Results persistance (#1908)
* feat persistance.md

* doc

* doc

* lint

* doc

* fix

* doc
2025-03-05 18:23:54 +08:00
Dongsheng Zhu
fff2d51440
[Update] Code evaluation alignment (#1909)
* code alignment

* update oss md5

* bigcodebench update

* lint

* lint_

* lint yapf
2025-03-04 18:49:38 +08:00
Linchen Xiao
5547fd1592
[Bump] Bump version to 0.4.1 2025-03-04 18:26:14 +08:00
liushz
198c08632e
[Feature] Add HLE (Humanity's Last Exam) dataset (#1902)
* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* update dataset path

* Update olmpiadBench

* Update olmpiadBench

* Update olmpiadBench

* Add HLE dataset

* Add HLE dataset

* Add HLE dataset

---------

Co-authored-by: sudanl <sudanl@foxmail.com>
2025-03-04 16:42:37 +08:00
Songyang Zhang
c84bc18ac1
[Update] Support OlympiadBench-Math/OmniMath/LiveMathBench-Hard (#1899)
* [Update] Support OlympiadBench-Math/OmniMath/LiveMathBench-Hard with LLM Verify

* Update

* Update

* Update DeepSeek-R1 example

* Update DeepSeek-R1 example

* Update DeepSeek-R1 example
2025-03-03 18:56:11 +08:00
Junnan Liu
f0809fe6f6
[Update] Fix Hard Configs With General GPassK (#1906)
* support dataset repeat and g-pass compute for each evaluator

* fix pre-commit errors

* delete print

* delete gpassk_evaluator and fix potential errors

* change `repeat` to `n`

* fix `repeat` to `n` in openicl_eval

* update doc for multi-run and g-pass

* update latex equation in doc

* update eng doc for multi-run and g-pass

* update datasets.md

* update datasets.md

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation in zh_cn user_guides

* mmodify pre-commit-zh-cn

* recover pre-commit and edit math expr in doc

* del [TIP]

* del cite tag in doc

* del extract_model param in livemathbench config

* fix livemathbench hard configs
2025-03-03 18:17:15 +08:00
Linchen Xiao
6a573f671b
[Fix] Fix compatible issue 2025-03-03 15:35:57 +08:00
Junnan Liu
73c80953c6
[Feature] Support Dataset Repeat and G-Pass Compute for Each Evaluator (#1886)
* support dataset repeat and g-pass compute for each evaluator

* fix pre-commit errors

* delete print

* delete gpassk_evaluator and fix potential errors

* change `repeat` to `n`

* fix `repeat` to `n` in openicl_eval

* update doc for multi-run and g-pass

* update latex equation in doc

* update eng doc for multi-run and g-pass

* update datasets.md

* update datasets.md

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation

* fix multi-line equation in zh_cn user_guides

* mmodify pre-commit-zh-cn

* recover pre-commit and edit math expr in doc

* del [TIP]

* del cite tag in doc

* del extract_model param in livemathbench config
2025-02-26 19:43:12 +08:00
zhulinJulia24
6042b88e58
[CI] update dailytest sceduler and baseline's score(#1898) 2025-02-26 19:04:01 +08:00
Linchen Xiao
bdb2d46f59
[Feature] Add general math, llm judge evaluator (#1892)
* update_doc

* update llm_judge

* update README

* update md file name
2025-02-26 15:08:50 +08:00
Songyang Zhang
fd6fbf01a2
[Update] Support AIME-24 Evaluation for DeepSeek-R1 series (#1888)
* Update

* Update

* Update

* Update
2025-02-25 20:34:41 +08:00
Junnan Liu
22a33d8759
[Update] Update LiveMathBench Hard Configs (#1826)
* support G-Pass@k and livemathbench

* fix bugs

* fix comments of GPassKEvaluator

* update saved details of GPassKEvaluator

* update saved details of GPassKEvaluator

* fix eval api configs & update openai_api for ease of debugging

* update huggingface path

* fix method name of G-Pass@k

* fix default value of eval_model_name

* refactor G-Pass@k evaluator

* log generation params for each backend

* fix evaluation resume

* add notimplementerror

* update livemathbench-hard configs

* remove max_out_len from livemathbench_hard_greedy_gen_9befbf.py

* remove max_out_len from livemathbench_hard_gen_9befbf.py

* rename livemathbench_hard_gen_9befbf.py to livemathbench_hard_gen_353ae7.py

* rename livemathbench_hard_greedy_gen_9befbf.py to livemathbench_hard_greedy_gen_353ae7.py

* update livemathbench_gen_9befbf.py

* remove whitespace

* upload livemathbench hard configs
2025-02-25 17:24:36 +08:00
Dongsheng Zhu
465e93e10e
[Update] Academic bench llm judge update (#1876)
* BigCodeBench update

* update LCBench

* update LCBench 2

* update code

* academicBench update

* academic bench ifeval&math update

* generic_llmjudge_aime_academic_postprocess delete

* aime delete

* postprocessors update

* ifeval delete

* update work_dir

* linting

* linting double-quote-string-fixer

* r1-distill out_len update

* fix lint

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-02-24 15:45:24 +08:00
Junnan Liu
046b6f75c6
[Update] Update Greedy Config & README of LiveMathBench (#1862)
* support omni-math

* update config

* upload README

* Delete opencompass/configs/datasets/omni_math/__init__.py

* update greedy config & README of LiveMathBench

* update intro for  max_out_len

* rename livemathbench greedy confi

* delete greedy config

---------

Co-authored-by: liushz <qq1791167085@163.com>
2025-02-20 19:47:04 +08:00
Linchen Xiao
d7daee6e25
[Update] OpenAI model update, bigcodebench update (#1879)
* [Update] Openai model update, bigcodebench update

* update
2025-02-20 19:33:25 +08:00
Linchen Xiao
27c916661d
[Feature] Math Verify with model post_processor (#1881)
* update

* [Feature] Update model post_processor

* update

* update

* update
2025-02-20 19:32:12 +08:00
zhulinJulia24
bc22749fd8
[CI] update daily test scores (#1870)
* update

* Update daily-run-test.yml

* Update dlc.py
2025-02-20 14:08:18 +08:00
bittersweet1999
f407930475
[Feature] Support subjective evaluation for reasoning model (#1868)
* fix pip version

* fix pip version

* add subeval for reasoning model

* add subeval for reasoning model

* update configs

* update config

* update config

* update config

* update files
2025-02-20 12:19:46 +08:00
Myhs_phz
68a9838907
[Feature] Add list of supported datasets at html page (#1850)
* feat dataset-index.yml and stat.py

* fix

* fix

* fix

* feat url of paper and config file

* doc all supported dataset list

* docs zh and en

* docs README zh and en

* docs new_dataset

* docs new_dataset
2025-02-14 16:17:30 +08:00
Dongsheng Zhu
3fd8b4e0cd
[Update] Update BigCodeBench & LCBench load path (#1857)
* BigCodeBench update

* update LCBench

* update LCBench 2

* update code
2025-02-08 15:15:47 +08:00
Pablo Hinojosa
9c2e6a192c
[Fix] Update broken links in README.md (#1852) 2025-02-07 15:41:08 +08:00
zhulinJulia24
ffc04cf650
[CI] Update daily-run-test.yml (#1854) 2025-02-07 14:40:16 +08:00
Linchen Xiao
862bf78464
[Demo] Internlm3 math500 thinking demo (#1846)
* [Demo] Add demo for Internlm3 math500 thinking

* [Demo] Add demo for Internlm3 math500 thinking

* update max_out_len

* update start instruction
2025-01-24 14:56:41 +08:00
Shudong Liu
412199f802
[Feature] Support OlympiadBench Benchmark (#1841)
* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* Support OlympiadBench Benchmark

* update dataset path

* Update olmpiadBench

* Update olmpiadBench

* Update olmpiadBench

---------

Co-authored-by: liushz <qq1791167085@163.com>
2025-01-24 10:00:01 +08:00
Junnan Liu
70f2c963d3
[Feature] Support Omni-Math (#1837)
* support omni-math

* update config

* upload README

* Delete opencompass/configs/datasets/omni_math/__init__.py

---------

Co-authored-by: liushz <qq1791167085@163.com>
2025-01-23 18:36:54 +08:00
Linchen Xiao
35ec307c6b
[Bump] Bump version to 0.4.0 (#1838) 2025-01-22 11:41:46 +08:00
Linchen Xiao
03415b2a66
[Fix] Update max_out_len logic for OpenAI model (#1839) 2025-01-21 15:46:14 +08:00
Linchen Xiao
a6193b4c02
[Refactor] Code refactoarization (#1831)
* Update

* fix lint

* update

* fix lint
2025-01-20 19:17:38 +08:00
Jishnu Nair
ffdc917523
[Doc] Installation.md update (#1830) 2025-01-17 11:08:09 +08:00
Myhs_phz
70da9b7776
[Update] Update method to add dataset in docs (#1827)
* create new branch

* docs new_dataset.md zh

* docs new_dataset.md zh and en
2025-01-17 11:07:19 +08:00
Linchen Xiao
531643e771
[Feature] Add support for InternLM3 (#1829)
* update

* update

* update

* update
2025-01-16 14:28:27 +08:00
Alexander Lam
7f2aeeff26
added predicted win rates reporting to bradley terry subj eval methods with an option to switch between win rates and elo ratings (#1815) 2025-01-10 18:20:25 +08:00
zhulinJulia24
121d482378
[CI] Fix path conflict (#1814)
* update

* Update pr-run-test.yml

* update
2025-01-09 20:16:08 +08:00
zhulinJulia24
abdcee68f6
[CI] Update daily test metrics threshold (#1812)
* Update daily-run-test.yml

* Update pr-run-test.yml

* update

* update

* update

* updaet

* update

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-01-09 18:16:24 +08:00
Zhao Qihao
e039f3efa0
[Feature] Support MMLU-CF Benchmark (#1775)
* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* Update mmlu-cf

* Update mmlu-cf

* Update mmlu-cf

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* [Feature] Support MMLU-CF Benchmark

* Remove outside configs

---------

Co-authored-by: liushz <qq1791167085@163.com>
2025-01-09 14:11:20 +08:00
Songyang Zhang
f1e50d4bf0
[Update] Update LiveMathBench (#1809)
* Update LiveMathBench

* Update New O1 Evaluation

* Update O1 evaluation
2025-01-07 19:16:12 +08:00
Songyang Zhang
8fdb72f567
[Update] Update o1 eval prompt (#1806)
* Update XML prediction post-process

* Update LiveMathBench

* Update LiveMathBench

* Update New O1 Evaluation
2025-01-07 00:14:32 +08:00
Alexander Lam
f871e80887
[Feature] Add Bradley-Terry Subjective Evaluation method to Arena Hard dataset (#1802)
* added base_models_abbrs to references (passed from LMEvaluator); added bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer;

* added bradleyterry subjective evaluation method to arena_hard dataset
2025-01-03 16:33:43 +08:00
Linchen Xiao
117dc500ad
[Feature] Add Longbenchv2 support (#1801)
* Create eval_longbenchv2.py

* Create longbenchv2_gen.py

* Update __init__.py

* Create longbenchv2.py

* Update datasets_info.py

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: abrohamLee <146956824+abrohamLee@users.noreply.github.com>
2025-01-03 12:04:29 +08:00
Linchen Xiao
f3220438bc
[BUMP] Bump version to 0.3.9 (#1790) 2024-12-31 16:52:47 +08:00
liushz
9c980cbc62
[Feature] Add LiveStemBench Dataset (#1794)
* [Fix] Fix vllm max_seq_len parameter transfer

* [Fix] Fix vllm max_seq_len parameter transfer

* Add livestembench dataset

* Add livestembench dataset

* Add livestembench dataset

* Update livestembench_gen_3e3c50.py

* Update eval_livestembench.py

* Update eval_livestembench.py
2024-12-31 15:17:39 +08:00
Songyang Zhang
fc0556ec8e
[Fix] Fix generic_llm_evaluator output_path (#1798)
* Fix output_path

* Add Logger
2024-12-31 13:05:05 +08:00
Alexander Lam
dc6035cfcb
[Feature] Added Bradley-Terry subjective evaluation 2024-12-31 11:01:23 +08:00
Songyang Zhang
98435dd98e
[Feature] Update o1 evaluation with JudgeLLM (#1795)
* Update Generic LLM Evaluator

* Update o1 style evaluator
2024-12-30 17:31:00 +08:00
Junnan Liu
8e8d4f1c64
[Feature] Support G-Pass@k and LiveMathBench (#1772)
* support G-Pass@k and livemathbench

* fix bugs

* fix comments of GPassKEvaluator

* update saved details of GPassKEvaluator

* update saved details of GPassKEvaluator

* fix eval api configs & update openai_api for ease of debugging

* update huggingface path

* fix method name of G-Pass@k

* fix default value of eval_model_name

* refactor G-Pass@k evaluator

* log generation params for each backend

* fix evaluation resume

* add notimplementerror
2024-12-30 16:59:39 +08:00
Linchen Xiao
42b54d6bb8
[Update] Add 0shot CoT config for TheoremQA (#1783) 2024-12-27 16:17:27 +08:00
bittersweet1999
357ce8c7a4
[Fix] Fix model summarizer abbr (#1789)
* fix pip version

* fix pip version

* fix model summarizer abbr

---------

Co-authored-by: root <bittersweet1999>
2024-12-27 14:45:08 +08:00
Linchen Xiao
ae9efb73ad
[CI] Pypi deploy workflow update (#1786) 2024-12-27 14:08:37 +08:00
Linchen Xiao
f103e90764
[CI] Update deploy python version (#1784) 2024-12-27 13:35:36 +08:00
zhulinJulia24
ebeb578fbf
[ci] remove daily step retry and update pr score (#1782)
[ci] remove daily step retry
2024-12-26 16:51:26 +08:00
Linchen Xiao
56eaac6d8f
[Update] Volc status exception handle (#1780)
* update

* update
2024-12-26 15:43:24 +08:00
zhulinJulia24
c48bbde26f
[ci] remove testcase into volc engine (#1777)
* update

* update

* update

* update

* update

* update

* updaste

* update

* update

* update

* update

* update

* update

* update

* updaste

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Update daily-run-test.yml

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update
2024-12-25 17:26:50 +08:00
Linchen Xiao
ebefffed61
[Update] Update OC academic 202412 (#1771)
* [Update] Update academic settings

* Update

* update
2024-12-19 18:07:34 +08:00
Chang Lan
d70100cdf2
[Update] Customizable tokenizer for RULER (#1731)
* Customizable tokenizer for RULER

* Relax requirements
2024-12-19 18:02:11 +08:00
Junnan Liu
499302857f
[Fix] Fix Local Runner Params Save Path (#1768)
* update local runner params save dir

* fix remove

* fix directory remove

* Fix *_params.py by uuid4
2024-12-19 16:07:34 +08:00
Mashiro
9a5adbde6a
[Fix] Fix lark reporter issue (#1769) 2024-12-18 19:33:06 +08:00
zhulinJulia24
111f817e04
[ci] add fullbench testcase (#1766)
add volc testcase
2024-12-18 13:24:28 +08:00
bittersweet1999
38dba9919b
[Fix] Fix Subjective summarizer order error (#1767)
* fix pip version

* fix pip version

* fix order error
2024-12-18 13:21:31 +08:00
Linchen Xiao
d593bfeac8
[Bump] Bump version to 0.3.8 (#1765)
* [Bump] Bump version to 0.3.8

* Update README.md
2024-12-17 19:17:18 +08:00
Linchen Xiao
eadbdcb4cb
[Update] Update requirement and deepseek configurations (#1764) 2024-12-17 10:16:47 +08:00
liushz
5c8e91f329
[Fix] Fix vllm max_seq_len parameter transfer (#1745)
* [Fix] Fix vllm max_seq_len parameter transfer

* [Fix] Fix vllm max_seq_len parameter transfer

* Update pr-run-test.yml

* Update pr-run-test.yml

---------

Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>
2024-12-16 21:44:36 +08:00
Alexander Lam
1bd594fc62
[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model (#1751)
* fix lint issues

* updated gitignore

* changed infer_order from random to double for the pairwise_judge.py (not changing for pairwise_bt_judge.py

* added return statement to CompassArenaBradleyTerrySummarizer to return overall score for each judger model
2024-12-16 13:41:28 +08:00
zhulinJulia24
aeded4c4db
add new dataset summerizer (#1758)
add new dataset summerizer
2024-12-13 09:50:43 +08:00
zhulinJulia24
a1c00cc8b7
[ci] add common_summarizer return (#1724)
* Update common_summarizer.py

* Update common_summarizer.py
2024-12-11 20:38:32 +08:00
liushz
c4ce0174fe
[Fix] Fix ChineseSimpleQA max_out_len (#1757)
* add chinese simpleqa config

* add chinese simpleqa config

* add chinese simpleqa config

* add chinese simpleqa config

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* pdate Csimpleqa

* pdate Csimpleqa

* Update Csimpleqa

---------

Co-authored-by: 明念 <heyancheng.hyc@taobao.com>
2024-12-11 19:51:27 +08:00
Linchen Xiao
bd7b705be4
[Update] Update dataset configuration with no max_out_len (#1754) 2024-12-11 18:20:29 +08:00
OpenStellarTeam
1a5b3fc11e
Add Chinese SimpleQA config (#1697)
* add chinese simpleqa config

* add chinese simpleqa config

* add chinese simpleqa config

* add chinese simpleqa config

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* Update CsimpleQA

* pdate Csimpleqa

---------

Co-authored-by: 明念 <heyancheng.hyc@taobao.com>
Co-authored-by: liushz <qq1791167085@163.com>
2024-12-11 18:03:39 +08:00
Linchen Xiao
0d26b348e4
[Feature] Add OC academic 2412 (#1750) 2024-12-10 21:53:06 +08:00
bittersweet1999
54c0fb7a93
[Change] Change Compassarena metric (#1749)
* fix pip version

* fix pip version

* fix summarizer bug

* fix compassarena

* fix compassarena

* fix compassarena
2024-12-10 14:45:32 +08:00
Songyang Zhang
0d8df541bc
[Update] Update O1-style Benchmark and Prompts (#1742)
* Update JuderBench

* Support O1-style Prompts

* Update Code

* Update OpenAI

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update

* Update

* Update

* Update
2024-12-09 13:48:56 +08:00
Junnan Liu
f333be177c
[Update] Add MATH500 & AIME2024 to LiveMathBench (#1741)
* upload dataset definitions & configs

* add single dataset split specific metrics

* add k-pass@threshold & MATH500

* update std computation & k-pass computation

* add AIME224

* update README
2024-12-06 14:36:49 +08:00
bittersweet1999
08d63b5bf3
[Fix] Fix error in subjective default summarizer (#1740)
* fix pip version

* fix pip version

* fix summarizer bug
2024-12-06 11:03:53 +08:00
Songyang Zhang
fb43dd1906
[Update] Update Skywork/Qwen-QwQ (#1728)
* Update JuderBench

* Support O1-style Prompts

* Update Code

* Update OpenAI

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update BigCodeBench

* Update
2024-12-05 19:30:43 +08:00
Junnan Liu
6181ac1122
[Update] Update LiveMathBench Evaluation to Support Single Dataset Split Metric Computation (#1730)
* upload dataset definitions & configs

* add single dataset split specific metrics

* add k-pass@threshold & MATH500
2024-12-05 16:54:16 +08:00
Linchen Xiao
4f317d1bd5
[Update] Update Manifest (#1738) 2024-12-05 13:59:56 +08:00
Linchen Xiao
ac23f0ce1f
[Update] Update init file for Korbench (#1737) 2024-12-05 11:26:00 +08:00
Yufeng Zhao
4d773904d4
[Update] Korbench readme supplementation (#1734)
* renewed

* readme

---------

Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2024-12-05 11:24:35 +08:00
Linchen Xiao
a011be6798
[Feature] DLC runner Lark report (#1735)
* [Bump] Bump version to 0.3.7

* DLC lark report update
2024-12-04 18:03:12 +08:00
Linchen Xiao
e2a290fd46
[Bump] Bump version to 0.3.7 (#1733) 2024-12-03 19:34:57 +08:00
Yufeng Zhao
98c4666d65
[Update] Update Korbench dataset abbr (#1729)
Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2024-12-02 16:20:58 +08:00
Linchen Xiao
9de27b4d85
[Update] Update max_out_len for datasets (#1726)
* [Update] Update max_out_len for datasets

* Update eval_regression_chat_objective_fullbench.py

* Update eval_regression_chat.py

* Update eval_regression_chat.py

* Update oc_score_baseline_fullbench.yaml

---------

Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>
2024-12-02 11:42:07 +08:00
Junnan Liu
fe6d76fb13
[Feature] Support LiveMathBench (#1727) 2024-11-30 00:07:19 +08:00
liushz
b063779034
[Fix] Update P-MMEVAL OSS data (#1722)
* Update with PMMEval

* Update

* Update __init__.py

* Fix Bugs

* Delete .pre-commit-config.yaml

* Pull merge

* Fix pmmeval_gen config

* Update P-MMEVAL data

---------

Co-authored-by: wanyu <wanyu2018umac@gmail.com>
Co-authored-by: wanyu2018umac <42405907+wanyu2018umac@users.noreply.github.com>
2024-11-28 20:55:46 +08:00
liushz
c437135fad
[Feature] Add Openai Simpleqa dataset (#1720)
* Add Openai SimpleQA dataset

* Add Openai SimpleQA dataset

* Add Openai SimpleQA dataset

* Update eval_simpleqa.py

---------

Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2024-11-28 19:16:07 +08:00
liushz
06ab27861e
[Fix] Fix pmmeval_gen config (#1719)
* Update with PMMEval

* Update

* Update __init__.py

* Fix Bugs

* Delete .pre-commit-config.yaml

* Pull merge

* Fix pmmeval_gen config

---------

Co-authored-by: wanyu <wanyu2018umac@gmail.com>
Co-authored-by: wanyu2018umac <42405907+wanyu2018umac@users.noreply.github.com>
2024-11-28 11:53:36 +08:00
wanyu2018umac
90efcf2216
[Feature] Add P-MMEval (#1714)
* Update with PMMEval

* Update

* Update __init__.py

* Fix Bugs

* Delete .pre-commit-config.yaml

* Pull merge

---------

Co-authored-by: liushz <qq1791167085@163.com>
2024-11-27 21:26:18 +08:00
Junnan Liu
f7dbe6bb7d
[Feature] Add Arc Prize Public Evaluation (#1690)
* support arc prize

* update arc-prize dataset info & update arc-prize evaluation performance
2024-11-27 15:44:41 +08:00
Yi Ding
bcb707dbfc
[Fix] Fix BailingAPI model (#1707)
* [fix] sequence under the multiple samples

* resolve the lint problems

* change the parameter name

* add another error code for retry

* output the log for invalid response

* format correction

* update

* update

* update

* update

* add two model python files

* update the default parameter

* use random for delay

* update the api example of bailing

* remove the unnecessary parameter
2024-11-26 19:24:47 +08:00
Linchen Xiao
ef695e28e5
[Bug] Fix Korbench dataset module (#1717) 2024-11-26 17:13:28 +08:00
Songyang Zhang
f97c4eae42
[Update] Update Fullbench (#1712)
* Update JuderBench

* Support O1-style Prompts

* Update Code
2024-11-26 14:26:55 +08:00
Yufeng Zhao
300adc31e8
[Feature] Add Korbench dataset (#1713)
* first version for korbench

* first stage for korbench

* korbench_1

* korbench_1

* korbench_1

* korbench_1

* korbench_1_revised

* korbench_combined_1

* korbench_combined_1

* kor_combined

* kor_combined

* update

---------

Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2024-11-25 20:11:27 +08:00
Chang Lan
5c1916ea4c
[Update] Add RULER 64k config (#1709) 2024-11-25 19:35:27 +08:00
liushz
e49fcfd3a3
[Update] Update MATH dataset with model judge (#1711)
* Update math with llm judge

* Update math with llm judge

* Update math with llm judge

* Update math with llm judge

* Update math with llm judge
2024-11-25 15:14:55 +08:00
Linchen Xiao
80e3b9ef37
[Update] Add math prm 800k (#1708) 2024-11-21 21:29:43 +08:00
Linchen Xiao
500fb1032a
[Update] Update configurations (#1704) 2024-11-21 16:51:18 +08:00
zhulinJulia24
ed81f9df30
[CI] update torch version and add more datasets into daily testcase (#1701)
* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-11-21 10:37:33 +08:00
Yi Ding
05044dfaf2
[Update] Support new error code for Bailing model (#1702)
* support new error code

* fix the lint problems
2024-11-20 16:40:22 +08:00
Linchen Xiao
ff831b153e
[BUMP] Bump version to 0.3.6 (#1694) 2024-11-18 20:24:50 +08:00
Linchen Xiao
ab8fdbbaab
[Update] Update Math auto-download data (#1700) 2024-11-18 20:24:35 +08:00
Linchen Xiao
98242ff1d1
[Update] first_option_postprocess (#1699)
* update first_option_postprocess

* update
2024-11-18 20:14:29 +08:00
Linchen Xiao
4653f6976e
[Update] update volc CPU flavor (#1698) 2024-11-18 12:33:51 +08:00
zhulinJulia24
4a20e1176d
[CI] Update baselines (#1693)
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-11-15 14:46:29 +08:00
Linchen Xiao
40a9f0be0d
[Update] MUSR dataset config prefix update (#1692) 2024-11-15 11:06:30 +08:00
abrohamLee
e9e4b69ddb
[Feature] MuSR Datset Evaluation (#1689)
* MuSR Datset Evaluation

* MuSR Datset Evaluation

Add an assertion and a Readme.md
2024-11-14 20:42:12 +08:00
Linchen Xiao
d415439f9b
[Fix] Fix bug for first_option_postprocess (#1688) 2024-11-14 16:45:59 +08:00
Linchen Xiao
e92a5d4230
[Feature] BABILong Dataset added (#1684)
* update

* update

* update

* update
2024-11-14 15:32:43 +08:00
Linchen Xiao
2fee63f537
[Update] Auto-download for followbench (#1685) 2024-11-13 15:47:29 +08:00
zhulinJulia24
f8a1c1f487
[CI] update (#1682)
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-11-13 10:48:05 +08:00
bittersweet1999
aca8ec3c6a
[Hotfix] Hotfix (#1683)
* fix pip version

* fix pip version

* fix lint

* hotfix
2024-11-13 10:14:27 +08:00
zhulinJulia24
a9d6b6461f
[ci] react daily test (#1668)
* updaste

* update

* update

* update

* update

* update

* update

* update

* update

* update

* updaste

* update

* update

* refactor summarize

* update

* update

* update

* update

* update

* updaste

* update

* update

* update

* update

* updaste

* update

* update

* update

* update

* update

* updaste

* updaste

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Update daily-run-test.yml

* Update daily-run-test.yml

* update

* update

* update

* update

* update

* Update daily-run-test.yml

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Update daily-run-test.yml

* Update daily-run-test.yml

* update

* update

* Update daily-run-test.yml

* update

* update

* update

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-11-12 18:40:27 +08:00
sobeit
3ec178f4a9
add single lora adapter support for vLLM inference. (#1679) 2024-11-12 17:31:36 +08:00
bittersweet1999
17b5e52f6c
[Hotfix] lmdeploy temp (#1674)
* fix pip version

* fix pip version

* hotfix
2024-11-12 16:10:16 +08:00
Linchen Xiao
a0ef2fd3b4
[Update] Dingo Dataset update (#1670)
* [Update] Dingo Dataset update

* update
2024-11-08 14:38:43 +08:00
Linchen Xiao
835bf75a36
[Feature] Add long context evaluation for base models (#1666)
* [Update] Add base long context evaluation

* update
2024-11-08 10:53:29 +08:00
Chang Cheng
fd7aa83c01
[Update] Update DLC Runner(#1662)
* push interntrain hard code

* push interntrain hard code

* remove redundant post process

---------

Co-authored-by: changcheng <changcheng@pjlab.org.cb>
Co-authored-by: changcheng <changcheng@pjlab.org.cn>
2024-11-07 15:45:35 +08:00
2339 changed files with 48023 additions and 65926 deletions

42
.github/scripts/eval_regression_api.py vendored Normal file
View File

@ -0,0 +1,42 @@
from mmengine.config import read_base
from opencompass.models.openai_api import OpenAISDK
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_gen import \
race_datasets # noqa: F401, E501
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
models = [
dict(
abbr='lmdeploy-api-test',
type=OpenAISDK,
key='EMPTY',
openai_api_base='http://localhost:23333/v1',
path='internlm3',
tokenizer_path='internlm/internlm3-8b-instruct',
rpm_verbose=True,
meta_template=api_meta_template,
query_per_second=128,
max_out_len=1024,
max_seq_len=4096,
temperature=0.01,
batch_size=128,
retry=20,
)
]
for d in datasets:
d['reader_cfg']['test_range'] = '[0:16]'

View File

@ -0,0 +1,210 @@
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.ARC_c.ARC_c_few_shot_ppl import \
ARC_c_datasets # noqa: F401, E501
from opencompass.configs.datasets.bbh.bbh_gen_98fba6 import \
bbh_datasets # noqa: F401, E501
from opencompass.configs.datasets.cmmlu.cmmlu_ppl_041cbf import \
cmmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.dingo.dingo_gen import \
datasets as dingo_datasets # noqa: F401, E501
from opencompass.configs.datasets.drop.drop_gen_a2697c import \
drop_datasets # noqa: F401, E501
from opencompass.configs.datasets.GaokaoBench.GaokaoBench_no_subjective_gen_d21e37 import \
GaokaoBench_datasets # noqa: F401, E501
from opencompass.configs.datasets.gpqa.gpqa_few_shot_ppl_4b5a83 import \
gpqa_datasets # noqa: F401, E501
# Corebench v1.7
from opencompass.configs.datasets.gsm8k.gsm8k_gen_17d0dc import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.hellaswag.hellaswag_10shot_ppl_59c85e import \
hellaswag_datasets # noqa: F401, E501
from opencompass.configs.datasets.humaneval.internal_humaneval_gen_ce6b06 import \
humaneval_datasets as humaneval_v2_datasets # noqa: F401, E501
from opencompass.configs.datasets.humaneval.internal_humaneval_gen_d2537e import \
humaneval_datasets # noqa: F401, E501
from opencompass.configs.datasets.math.math_4shot_base_gen_43d5b6 import \
math_datasets # noqa: F401, E501
from opencompass.configs.datasets.MathBench.mathbench_2024_few_shot_mixed_4a3fd4 import \
mathbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.mbpp.sanitized_mbpp_gen_742f0c import \
sanitized_mbpp_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu.mmlu_ppl_ac766d import \
mmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_few_shot_gen_bfaf90 import \
mmlu_pro_datasets # noqa: F401, E501
from opencompass.configs.datasets.nq.nq_open_1shot_gen_20a989 import \
nq_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_few_shot_ppl import \
race_datasets # noqa: F401, E501
from opencompass.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_few_shot_ppl import \
BoolQ_datasets # noqa: F401, E501
from opencompass.configs.datasets.TheoremQA.TheoremQA_5shot_gen_6f0af8 import \
TheoremQA_datasets # noqa: F401, E501
from opencompass.configs.datasets.triviaqa.triviaqa_wiki_1shot_gen_20a989 import \
triviaqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.wikibench.wikibench_few_shot_ppl_c23d79 import \
wikibench_datasets # noqa: F401, E501
from opencompass.configs.datasets.winogrande.winogrande_5shot_ll_252f01 import \
winogrande_datasets # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b import \
models as hf_internlm2_5_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b import \
models as lmdeploy_internlm2_5_7b_model # noqa: F401, E501
from opencompass.configs.summarizers.groups.bbh import \
bbh_summary_groups # noqa: F401, E501
# Summary Groups
from opencompass.configs.summarizers.groups.cmmlu import \
cmmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.GaokaoBench import \
GaokaoBench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mathbench_v1_2024 import \
mathbench_2024_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu import \
mmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups # noqa: F401, E501
from ...volc import infer as volc_infer # noqa: F401, E501
race_datasets = [race_datasets[1]] # Only take RACE-High
humaneval_v2_datasets[0]['abbr'] = 'openai_humaneval_v2'
bbh_datasets = [
x for x in bbh_datasets if 'logical_deduction_seven_objects' in x['abbr']
or 'multistep_arithmetic_two' in x['abbr']
]
cmmlu_datasets = [
x for x in cmmlu_datasets if x['abbr'].replace('cmmlu-', '') in [
'ancient_chinese', 'chinese_civil_service_exam',
'chinese_driving_rule', 'chinese_food_culture',
'chinese_foreign_policy', 'chinese_history', 'chinese_literature',
'chinese_teacher_qualification', 'construction_project_management',
'elementary_chinese', 'elementary_commonsense', 'ethnology',
'high_school_politics', 'modern_chinese',
'traditional_chinese_medicine'
]
]
mmlu_datasets = [
x for x in mmlu_datasets if x['abbr'].replace('lukaemon_mmlu_', '') in [
'business_ethics', 'clinical_knowledge', 'college_medicine',
'global_facts', 'human_aging', 'management', 'marketing',
'medical_genetics', 'miscellaneous', 'nutrition',
'professional_accounting', 'professional_medicine', 'virology'
]
]
mmlu_pro_datasets = [mmlu_pro_datasets[0]]
mathbench_datasets = [x for x in mathbench_datasets if 'college' in x['abbr']]
GaokaoBench_datasets = [
x for x in GaokaoBench_datasets if '2010-2022_Math_II_MCQs' in x['abbr']
or '2010-2022_Math_II_Fill-in-the-Blank' in x['abbr']
]
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], [])
summary_groups.append(
{
'name': 'Mathbench',
'subsets': ['mathbench-a (average)', 'mathbench-t (average)'],
}, )
summarizer = dict(
dataset_abbrs=[
'Language',
['race-high', 'accuracy'],
['ARC-c', 'accuracy'],
['BoolQ', 'accuracy'],
['triviaqa_wiki_1shot', 'score'],
['nq_open_1shot', 'score'],
'',
'General Reasoning',
['drop', 'accuracy'],
['bbh', 'naive_average'],
['GPQA_diamond', 'accuracy'],
['hellaswag', 'accuracy'],
['TheoremQA', 'score'],
['winogrande', 'accuracy'],
'',
'Math Calculation',
['gsm8k', 'accuracy'],
['GaokaoBench', 'weighted_average'],
'GaokaoBench_2010-2022_Math_II_MCQs',
'GaokaoBench_2010-2022_Math_II_Fill-in-the-Blank',
['math', 'accuracy'],
['Mathbench', 'naive_average'],
'',
'Knowledge',
['wikibench-wiki-single_choice_cncircular', 'perf_4'],
['cmmlu', 'naive_average'],
['mmlu', 'naive_average'],
['mmlu_pro', 'naive_average'],
'',
'Code',
['openai_humaneval', 'humaneval_pass@1'],
['openai_humaneval_v2', 'humaneval_pass@1'],
['sanitized_mbpp', 'score'],
'',
['dingo_en_192', 'score'],
['dingo_zh_170', 'score'],
'',
'mmlu',
'mmlu-stem',
'mmlu-social-science',
'mmlu-humanities',
['mmlu-other', 'accuracy'],
'',
'cmmlu',
'cmmlu-stem',
'cmmlu-social-science',
'cmmlu-humanities',
'cmmlu-other',
['cmmlu-china-specific', 'accuracy'],
'',
'mmlu_pro',
'mmlu_pro_biology',
'mmlu_pro_business',
'mmlu_pro_chemistry',
'mmlu_pro_computer_science',
'mmlu_pro_economics',
'mmlu_pro_engineering',
'mmlu_pro_health',
'mmlu_pro_history',
'mmlu_pro_law',
'mmlu_pro_math',
'mmlu_pro_philosophy',
'mmlu_pro_physics',
'mmlu_pro_psychology',
'mmlu_pro_other',
'',
'bbh-logical_deduction_seven_objects',
'bbh-multistep_arithmetic_two',
'###### MathBench-A: Application Part ######',
'college',
'high',
'middle',
'primary',
'arithmetic',
'mathbench-a (average)',
'###### MathBench-T: Theory Part ######',
'college_knowledge',
'high_knowledge',
'middle_knowledge',
'primary_knowledge',
'mathbench-t (average)',
],
summary_groups=summary_groups,
)
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
for d in datasets:
d['reader_cfg']['test_range'] = '[0:16]'
for m in models:
m['abbr'] = m['abbr'] + '_fullbench'
if 'turbomind' in m['abbr'] or 'lmdeploy' in m['abbr']:
m['engine_config']['max_batch_size'] = 1
m['batch_size'] = 1
models = sorted(models, key=lambda x: x['run_cfg']['num_gpus'])

View File

@ -2,51 +2,85 @@ from mmengine.config import read_base
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.gpqa.gpqa_openai_simple_evals_gen_5aeece import \
gpqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.gsm8k.gsm8k_gen_17d0dc import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_ppl import \
race_datasets # noqa: F401, E501
from opencompass.configs.models.deepseek.hf_deepseek_moe_16b_base import \
models as hf_deepseek_moe_16b_base_model # noqa: F401, E501
from opencompass.configs.models.deepseek.hf_deepseek_v2_lite import \
models as hf_deepseek_v2_lite_model # noqa: F401, E501
from opencompass.configs.datasets.winogrande.winogrande_5shot_ll_252f01 import \
winogrande_datasets # noqa: F401, E501
# read hf models - chat models
from opencompass.configs.models.chatglm.lmdeploy_glm4_9b import \
models as lmdeploy_glm4_9b_model # noqa: F401, E501
from opencompass.configs.models.deepseek.hf_deepseek_7b_base import \
models as hf_deepseek_7b_base_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_7b_base import \
models as lmdeploy_deepseek_7b_base_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_67b_base import \
models as lmdeploy_deepseek_67b_base_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_v2 import \
lmdeploy_deepseek_v2_model # noqa: F401, E501
from opencompass.configs.models.deepseek.vllm_deepseek_moe_16b_base import \
models as vllm_deepseek_moe_16b_base_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma2_2b import \
models as hf_gemma2_2b_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma2_9b import \
models as hf_gemma2_9b_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma_2b import \
models as hf_gemma_2b_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma_7b import \
models as hf_gemma_7b_model # noqa: F401, E501
from opencompass.configs.models.gemma.lmdeploy_gemma_9b import \
models as lmdeploy_gemma_9b_model # noqa: F401, E501
from opencompass.configs.models.gemma.vllm_gemma_2b import \
models as vllm_gemma_2b_model # noqa: F401, E501
from opencompass.configs.models.gemma.vllm_gemma_7b import \
models as vllm_gemma_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b import \
models as hf_internlm2_5_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_7b import \
models as hf_internlm2_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_base_7b import \
models as hf_internlm2_base_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_1_8b import \
models as lmdeploy_internlm2_1_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b import \
models as lmdeploy_internlm2_5_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_7b import \
models as lmdeploy_internlm2_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_20b import \
models as lmdeploy_internlm2_20b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_base_7b import \
models as lmdeploy_internlm2_base_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_base_20b import \
models as lmdeploy_internlm2_base_20b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama2_7b import \
models as hf_llama2_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama3_1_8b import \
models as hf_llama3_1_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama3_8b import \
models as hf_llama3_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b import \
models as lmdeploy_llama3_1_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b import \
models as lmdeploy_llama3_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_70b import \
models as lmdeploy_llama3_70b_model # noqa: F401, E501
from opencompass.configs.models.mistral.hf_mistral_7b_v0_3 import \
models as hf_mistral_7b_v0_3_model # noqa: F401, E501
from opencompass.configs.models.mistral.vllm_mistral_7b_v0_2 import \
models as vllm_mistral_7b_v0_2_model # noqa: F401, E501
from opencompass.configs.models.mistral.vllm_mixtral_8x7b_v0_1 import \
models as vllm_mixtral_8x7b_v0_1_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.hf_qwen_2_5_7b import \
models as hf_qwen_2_5_7b_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.hf_qwen_2_5_14b import \
models as hf_qwen_2_5_14b_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.hf_qwen_2_5_32b import \
models as hf_qwen_2_5_32b_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_1_5b import \
models as lmdeploy_qwen2_5_1_5b_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_7b import \
models as lmdeploy_qwen2_5_7b_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_32b import \
models as lmdeploy_qwen2_5_32b_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_72b import \
models as lmdeploy_qwen2_5_72b_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen1_5_moe_a2_7b import \
models as hf_qwen1_5_moe_a2_7b_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen2_0_5b import \
@ -65,11 +99,31 @@ with read_base():
models as hf_yi_1_5_6b_model # noqa: F401, E501
from opencompass.configs.models.yi.hf_yi_1_5_9b import \
models as hf_yi_1_5_9b_model # noqa: F401, E501
from opencompass.configs.summarizers.medium import \
summarizer # noqa: F401, E501
from opencompass.configs.models.yi.lmdeploy_yi_1_5_9b import \
models as lmdeploy_yi_1_5_9b_model # noqa: F401, E501
from ...volc import infer as volc_infer # noqa: F401, E501
race_datasets = [race_datasets[1]]
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
for d in datasets:
d['reader_cfg']['test_range'] = '[0:100]'
d['reader_cfg']['test_range'] = '[0:32]'
for m in models:
if 'turbomind' in m['abbr'] or 'lmdeploy' in m['abbr']:
m['engine_config']['max_batch_size'] = 1
m['batch_size'] = 1
models = sorted(models, key=lambda x: x['run_cfg']['num_gpus'])
summarizer = dict(
dataset_abbrs=[
['gsm8k', 'accuracy'],
['GPQA_diamond', 'accuracy'],
['race-high', 'accuracy'],
['winogrande', 'accuracy'],
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)

View File

@ -1,127 +0,0 @@
from mmengine.config import read_base
from opencompass.models import OpenAISDK
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_gen import \
race_datasets # noqa: F401, E501
# read hf models - chat models
from opencompass.configs.models.baichuan.hf_baichuan2_7b_chat import \
models as hf_baichuan2_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.chatglm.hf_glm4_9b_chat import \
models as hf_glm4_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.chatglm.lmdeploy_glm4_9b_chat import \
models as lmdeploy_glm4_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.chatglm.vllm_glm4_9b_chat import \
models as vllm_glm4_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.deepseek.hf_deepseek_7b_chat import \
models as hf_deepseek_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.deepseek.hf_deepseek_moe_16b_chat import \
models as hf_deepseek_moe_16b_chat_model # noqa: F401, E501
from opencompass.configs.models.deepseek.hf_deepseek_v2_lite_chat import \
models as hf_deepseek_v2_lite_chat_model # noqa: F401, E501
from opencompass.configs.models.deepseek.vllm_deepseek_7b_chat import \
models as vllm_deepseek_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma2_2b_it import \
models as hf_gemma2_2b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma2_9b_it import \
models as hf_gemma2_9b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.vllm_gemma_7b_it import \
models as vllm_gemma_7b_it_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b_chat import \
models as hf_internlm2_5_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_20b_chat import \
models as hf_internlm2_5_20b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as lmdeploy_internlm2_5_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_20b_chat import \
models as lmdeploy_internlm2_5_20b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_1_8b import \
models as lmdeploy_internlm2_chat_1_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_1_8b_sft import \
models as lmdeploy_internlm2_chat_1_8b_sft_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b import \
models as lmdeploy_internlm2_chat_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b_sft import \
models as lmdeploy_internlm2_chat_7b_sft_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.vllm_internlm2_chat_7b import \
models as vllm_internlm2_chat_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama3_1_8b_instruct import \
models as hf_llama3_1_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama3_8b_instruct import \
models as hf_llama3_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import \
models as lmdeploy_llama3_1_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b_instruct import \
models as lmdeploy_llama3_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.mistral.hf_mistral_7b_instruct_v0_3 import \
models as hf_mistral_7b_instruct_v0_3_model # noqa: F401, E501
from opencompass.configs.models.mistral.vllm_mistral_7b_instruct_v0_2 import \
models as vllm_mistral_7b_instruct_v0_2_model # noqa: F401, E501
from opencompass.configs.models.mistral.vllm_mixtral_8x7b_instruct_v0_1 import \
models as vllm_mixtral_8x7b_instruct_v0_1_model # noqa: F401, E501
from opencompass.configs.models.openbmb.hf_minicpm_2b_dpo_fp32 import \
models as hf_minicpm_2b_dpo_fp32_model # noqa: F401, E501
from opencompass.configs.models.openbmb.hf_minicpm_2b_sft_bf16 import \
models as hf_minicpm_2b_sft_bf16_model # noqa: F401, E501
from opencompass.configs.models.openbmb.hf_minicpm_2b_sft_fp32 import \
models as hf_minicpm_2b_sft_fp32_model # noqa: F401, E501
from opencompass.configs.models.phi.hf_phi_3_mini_4k_instruct import \
models as hf_phi_3_mini_4k_instruct_model # noqa: F401, E501
from opencompass.configs.models.phi.hf_phi_3_small_8k_instruct import \
models as hf_phi_3_mini_8k_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen1_5_0_5b_chat import \
models as hf_qwen1_5_0_5b_chat_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen2_1_5b_instruct import \
models as hf_qwen2_1_5b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen2_7b_instruct import \
models as hf_qwen2_7b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.lmdeploy_qwen2_1_5b_instruct import \
models as lmdeploy_qwen2_1_5b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.lmdeploy_qwen2_7b_instruct import \
models as lmdeploy_qwen2_7b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.vllm_qwen1_5_0_5b_chat import \
models as vllm_qwen1_5_0_5b_chat_model # noqa: F401, E501
from opencompass.configs.models.yi.hf_yi_1_5_6b_chat import \
models as hf_yi_1_5_6b_chat_model # noqa: F401, E501
from opencompass.configs.models.yi.hf_yi_1_5_9b_chat import \
models as hf_yi_1_5_9b_chat_model # noqa: F401, E501
from opencompass.configs.summarizers.medium import \
summarizer # noqa: F401, E501
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
model_name = ''
models.append(
dict(
abbr='lmdeploy-api-test',
type=OpenAISDK,
key='EMPTY',
openai_api_base='http://judgemodel:10001/v1',
path='compass_judger_internlm2_102b_0508',
tokenizer_path='internlm/internlm2_5-20b-chat',
rpm_verbose=True,
meta_template=api_meta_template,
query_per_second=50,
max_out_len=1024,
max_seq_len=4096,
temperature=0.01,
batch_size=128,
retry=3,
))
for d in datasets:
d['reader_cfg']['test_range'] = '[0:100]'

View File

@ -0,0 +1,193 @@
from mmengine.config import read_base
with read_base():
# choose a list of datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_gen import \
race_datasets # noqa: F401, E501
# read hf models - chat models
from opencompass.configs.models.chatglm.hf_glm4_9b_chat import \
models as hf_glm4_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.chatglm.lmdeploy_glm4_9b_chat import \
models as lmdeploy_glm4_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.chatglm.vllm_glm4_9b_chat import \
models as vllm_glm4_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.deepseek.hf_deepseek_7b_chat import \
models as hf_deepseek_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_67b_chat import \
models as lmdeploy_deepseek_67b_chat_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_r1_distill_llama_8b import \
models as \
lmdeploy_deepseek_r1_distill_llama_8b_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_r1_distill_llama_70b import \
models as \
lmdeploy_deepseek_r1_distill_llama_70b_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_r1_distill_qwen_1_5b import \
models as \
lmdeploy_deepseek_r1_distill_qwen_1_5b_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_r1_distill_qwen_32b import \
models as \
lmdeploy_deepseek_r1_distill_qwen_32b_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_v2_5_1210 import \
models as lmdeploy_deepseek_v2_5_1210_model # noqa: F401, E501
from opencompass.configs.models.deepseek.lmdeploy_deepseek_v2_lite import \
models as lmdeploy_deepseek_v2_lite_model # noqa: F401, E501
from opencompass.configs.models.deepseek.vllm_deepseek_7b_chat import \
models as vllm_deepseek_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma2_2b_it import \
models as hf_gemma2_2b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma2_9b_it import \
models as hf_gemma2_9b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma2_27b_it import \
models as hf_gemma2_27b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma_2b_it import \
models as hf_gemma_2b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.hf_gemma_7b_it import \
models as hf_gemma_7b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.lmdeploy_gemma_9b_it import \
models as lmdeploy_gemma_9b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.lmdeploy_gemma_27b_it import \
models as lmdeploy_gemma_27b_it_model # noqa: F401, E501
from opencompass.configs.models.gemma.vllm_gemma_7b_it import \
models as vllm_gemma_7b_it_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b_chat import \
models as hf_internlm2_5_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_20b_chat import \
models as hf_internlm2_5_20b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm3_8b_instruct import \
models as hf_internlm3_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as lmdeploy_internlm2_5_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_20b_chat import \
models as lmdeploy_internlm2_5_20b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_1_8b import \
models as lmdeploy_internlm2_chat_1_8b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_1_8b_sft import \
models as lmdeploy_internlm2_chat_1_8b_sft_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b import \
models as lmdeploy_internlm2_chat_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b_sft import \
models as lmdeploy_internlm2_chat_7b_sft_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm3_8b_instruct import \
models as lmdeploy_internlm3_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.vllm_internlm2_chat_7b import \
models as vllm_internlm2_chat_7b_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama3_1_8b_instruct import \
models as hf_llama3_1_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama3_2_3b_instruct import \
models as hf_llama3_2_3b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.hf_llama3_8b_instruct import \
models as hf_llama3_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama2_7b_chat import \
models as lmdeploy_llama2_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_1_8b_instruct import \
models as lmdeploy_llama3_1_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_2_3b_instruct import \
models as lmdeploy_llama3_2_3b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_3_70b_instruct import \
models as lmdeploy_llama3_3_70b_instruct_model # noqa: F401, E501
from opencompass.configs.models.hf_llama.lmdeploy_llama3_8b_instruct import \
models as lmdeploy_llama3_8b_instruct_model # noqa: F401, E501
from opencompass.configs.models.mistral.hf_mistral_7b_instruct_v0_2 import \
models as hf_mistral_7b_instruct_v0_2_model # noqa: F401, E501
from opencompass.configs.models.mistral.hf_mistral_7b_instruct_v0_3 import \
models as hf_mistral_7b_instruct_v0_3_model # noqa: F401, E501
from opencompass.configs.models.mistral.hf_mistral_nemo_instruct_2407 import \
models as hf_mistral_nemo_instruct_2407_model # noqa: F401, E501
from opencompass.configs.models.mistral.hf_mistral_small_instruct_2409 import \
models as hf_mistral_small_instruct_2409_model # noqa: F401, E501
from opencompass.configs.models.mistral.lmdeploy_mistral_large_instruct_2411 import \
models as \
lmdeploy_mistral_large_instruct_2411_model # noqa: F401, E501
from opencompass.configs.models.mistral.lmdeploy_mistral_nemo_instruct_2407 import \
models as lmdeploy_mistral_nemo_instruct_2407_model # noqa: F401, E501
from opencompass.configs.models.mistral.lmdeploy_mistral_small_instruct_2409 import \
models as \
lmdeploy_mistral_small_instruct_2409_model # noqa: F401, E501
from opencompass.configs.models.mistral.lmdeploy_mixtral_8x22b_instruct_v0_1 import \
models as \
lmdeploy_mixtral_8x22b_instruct_v0_1_model # noqa: F401, E501
from opencompass.configs.models.mistral.vllm_mistral_7b_instruct_v0_1 import \
models as vllm_mistral_7b_instruct_v0_1_model # noqa: F401, E501
from opencompass.configs.models.mistral.vllm_mistral_7b_instruct_v0_2 import \
models as vllm_mistral_7b_instruct_v0_2_model # noqa: F401, E501
from opencompass.configs.models.mistral.vllm_mixtral_8x22b_instruct_v0_1 import \
models as vllm_mixtral_8x22b_instruct_v0_1_model # noqa: F401, E501
from opencompass.configs.models.nvidia.lmdeploy_nemotron_70b_instruct_hf import \
models as lmdeploy_nemotron_70b_instruct_hf_model # noqa: F401, E501
from opencompass.configs.models.phi.hf_phi_4 import \
models as hf_phi_4_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.hf_qwen2_5_0_5b_instruct import \
models as hf_qwen2_5_0_5b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.hf_qwen2_5_3b_instruct import \
models as hf_qwen2_5_3b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.hf_qwen2_5_14b_instruct import \
models as hf_qwen2_5_14b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_0_5b_instruct import \
models as lmdeploy_qwen2_5_0_5b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_3b_instruct import \
models as lmdeploy_qwen2_5_3b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_14b_instruct import \
models as lmdeploy_qwen2_5_14b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_72b_instruct import \
models as lmdeploy_qwen2_5_72b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen1_5_0_5b_chat import \
models as hf_qwen1_5_0_5b_chat_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen2_1_5b_instruct import \
models as hf_qwen2_1_5b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.hf_qwen2_7b_instruct import \
models as hf_qwen2_7b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.lmdeploy_qwen2_1_5b_instruct import \
models as lmdeploy_qwen2_1_5b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.lmdeploy_qwen2_7b_instruct import \
models as lmdeploy_qwen2_7b_instruct_model # noqa: F401, E501
from opencompass.configs.models.qwen.vllm_qwen1_5_0_5b_chat import \
models as vllm_qwen1_5_0_5b_chat_model # noqa: F401, E501
from opencompass.configs.models.yi.hf_yi_1_5_6b_chat import \
models as hf_yi_1_5_6b_chat_model # noqa: F401, E501
from opencompass.configs.models.yi.hf_yi_1_5_9b_chat import \
models as hf_yi_1_5_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.yi.lmdeploy_yi_1_5_6b_chat import \
models as lmdeploy_yi_1_5_6b_chat_model # noqa: F401, E501
from opencompass.configs.models.yi.lmdeploy_yi_1_5_9b_chat import \
models as lmdeploy_yi_1_5_9b_chat_model # noqa: F401, E501
from opencompass.configs.models.yi.lmdeploy_yi_1_5_34b_chat import \
models as lmdeploy_yi_1_5_34b_chat_model # noqa: F401, E501
from ...volc import infer as volc_infer # noqa: F401, E501
hf_glm4_9b_chat_model[0]['path'] = 'THUDM/glm-4-9b-chat-hf'
race_datasets = [race_datasets[1]]
datasets = sum([v for k, v in locals().items() if k.endswith('_datasets')], [])
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
for d in datasets:
d['reader_cfg']['test_range'] = '[0:32]'
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
for m in models:
if 'turbomind' in m['abbr'] or 'lmdeploy' in m['abbr']:
m['engine_config']['max_batch_size'] = 1
m['batch_size'] = 1
models = sorted(models, key=lambda x: x['run_cfg']['num_gpus'])
summarizer = dict(
dataset_abbrs=[
'gsm8k',
'race-middle',
'race-high',
],
summary_groups=sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], []),
)

View File

@ -0,0 +1,317 @@
from mmengine.config import read_base
with read_base():
# read hf models - chat models
# Dataset
from opencompass.configs.datasets.aime2024.aime2024_gen_6e39a4 import \
aime2024_datasets # noqa: F401, E501
from opencompass.configs.datasets.ARC_c.ARC_c_cot_gen_926652 import \
ARC_c_datasets # noqa: F401, E501
# remove because of oom
# from opencompass.configs.datasets.ARC_Prize_Public_Evaluation.arc_prize_public_evaluation_gen_872059 import arc_prize_public_evaluation_datasets # noqa: F401, E501
from opencompass.configs.datasets.bbh.bbh_gen_5b92b0 import \
bbh_datasets # noqa: F401, E501
from opencompass.configs.datasets.bigcodebench.bigcodebench_hard_complete_gen_faf748 import \
bigcodebench_hard_complete_datasets # noqa: F401, E501
from opencompass.configs.datasets.bigcodebench.bigcodebench_hard_instruct_gen_8815eb import \
bigcodebench_hard_instruct_datasets # noqa: F401, E501
from opencompass.configs.datasets.cmmlu.cmmlu_0shot_cot_gen_305931 import \
cmmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.cmo_fib.cmo_fib_gen_ace24b import \
cmo_fib_datasets # noqa: F401, E501
from opencompass.configs.datasets.drop.drop_openai_simple_evals_gen_3857b0 import \
drop_datasets # noqa: F401, E501
from opencompass.configs.datasets.ds1000.ds1000_service_eval_gen_cbc84f import \
ds1000_datasets # noqa: F401, E501
from opencompass.configs.datasets.GaokaoBench.GaokaoBench_no_subjective_gen_4c31db import \
GaokaoBench_datasets # noqa: F401, E501
from opencompass.configs.datasets.gpqa.gpqa_openai_simple_evals_gen_5aeece import \
gpqa_datasets # noqa: F401, E501
# new datasets in Fullbench v1.1
from opencompass.configs.datasets.gsm8k.gsm8k_0shot_v2_gen_6e39a4 import \
gsm8k_datasets # noqa: F401, E501
from opencompass.configs.datasets.hellaswag.hellaswag_10shot_gen_e42710 import \
hellaswag_datasets # noqa: F401, E501
from opencompass.configs.datasets.humaneval.humaneval_openai_sample_evals_gen_dcae0e import \
humaneval_datasets # noqa: F401, E501
from opencompass.configs.datasets.humanevalx.humanevalx_gen_3d84a3 import \
humanevalx_datasets # noqa: F401, E501
from opencompass.configs.datasets.IFEval.IFEval_gen_353ae7 import \
ifeval_datasets # noqa: F401, E501
from opencompass.configs.datasets.korbench.korbench_single_0_shot_gen import \
korbench_0shot_single_datasets # noqa: F401, E501
from opencompass.configs.datasets.livecodebench.livecodebench_gen_b2b0fd import \
LCB_datasets # noqa: F401, E501
from opencompass.configs.datasets.math.math_0shot_gen_11c4b5 import \
math_datasets # noqa: F401, E501
from opencompass.configs.datasets.MathBench.mathbench_2024_gen_50a320 import \
mathbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.mbpp.sanitized_mbpp_mdblock_gen_a447ff import \
sanitized_mbpp_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu.mmlu_openai_simple_evals_gen_b618ea import \
mmlu_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmlu_pro.mmlu_pro_0shot_cot_gen_08c1de import \
mmlu_pro_datasets # noqa: F401, E501
from opencompass.configs.datasets.mmmlu_lite.mmmlu_lite_gen_c51a84 import \
mmmlu_lite_datasets # noqa: F401, E501
from opencompass.configs.datasets.musr.musr_gen_3622bb import \
musr_datasets # noqa: F401, E501
from opencompass.configs.datasets.nq.nq_open_1shot_gen_2e45e5 import \
nq_datasets # noqa: F401, E501
from opencompass.configs.datasets.race.race_cot_gen_d95929 import \
race_datasets # noqa: F401, E501
from opencompass.configs.datasets.scicode.scicode_gen_085b98 import \
SciCode_datasets # noqa: F401, E501
from opencompass.configs.datasets.SuperGLUE_BoolQ.SuperGLUE_BoolQ_cot_gen_1d56df import \
BoolQ_datasets # noqa: F401, E501
from opencompass.configs.datasets.teval.teval_en_gen_1ac254 import \
teval_datasets as teval_en_datasets # noqa: F401, E501
from opencompass.configs.datasets.teval.teval_zh_gen_1ac254 import \
teval_datasets as teval_zh_datasets # noqa: F401, E501
from opencompass.configs.datasets.TheoremQA.TheoremQA_5shot_gen_6f0af8 import \
TheoremQA_datasets # noqa: F401, E501
from opencompass.configs.datasets.triviaqa.triviaqa_wiki_1shot_gen_bc5f21 import \
triviaqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.wikibench.wikibench_gen_0978ad import \
wikibench_datasets # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b_chat import \
models as hf_internlm2_5_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as lmdeploy_internlm2_5_7b_chat_model # noqa: F401, E501
# Summary Groups
# Summary Groups
from opencompass.configs.summarizers.groups.bbh import \
bbh_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.cmmlu import \
cmmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.ds1000 import \
ds1000_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.GaokaoBench import \
GaokaoBench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.humanevalx import \
humanevalx_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.korbench import \
korbench_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mathbench_v1_2024 import \
mathbench_2024_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu import \
mmlu_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.mmlu_pro import \
mmlu_pro_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.musr_average import \
summarizer as musr_summarizer # noqa: F401, E501
from opencompass.configs.summarizers.groups.scicode import \
scicode_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.groups.teval import \
teval_summary_groups # noqa: F401, E501
from opencompass.configs.summarizers.mmmlu_lite import \
mmmlu_summary_groups # noqa: F401, E501
from ...volc import infer as volc_infer # noqa: F401, E501
# For HumanEval-X Evaluation
# Apply the evaluator ip_address and port
race_datasets = [race_datasets[1]]
for item in humanevalx_datasets:
item['eval_cfg']['evaluator'][
'ip_address'] = 'codeeval.opencompass.org.cn/humanevalx'
item['eval_cfg']['evaluator']['port'] = ''
# For DS-1000 Evaluation
# Apply the evaluator ip_address and port
for item in ds1000_datasets:
item['eval_cfg']['evaluator'][
'ip_address'] = 'codeeval.opencompass.org.cn/ds1000'
item['eval_cfg']['evaluator']['port'] = ''
bbh_datasets = [
x for x in bbh_datasets if 'logical_deduction_seven_objects' in x['abbr']
or 'multistep_arithmetic_two' in x['abbr']
]
cmmlu_datasets = [
x for x in cmmlu_datasets if x['abbr'].replace('cmmlu-', '') in [
'ancient_chinese', 'chinese_civil_service_exam',
'chinese_driving_rule', 'chinese_food_culture',
'chinese_foreign_policy', 'chinese_history', 'chinese_literature',
'chinese_teacher_qualification', 'construction_project_management',
'elementary_chinese', 'elementary_commonsense', 'ethnology',
'high_school_politics', 'modern_chinese',
'traditional_chinese_medicine'
]
]
mmlu_datasets = [
x for x in mmlu_datasets if x['abbr'].replace('lukaemon_mmlu_', '') in [
'business_ethics', 'clinical_knowledge', 'college_medicine',
'global_facts', 'human_aging', 'management', 'marketing',
'medical_genetics', 'miscellaneous', 'nutrition',
'professional_accounting', 'professional_medicine', 'virology'
]
]
mmlu_pro_datasets = [mmlu_pro_datasets[0]]
mmmlu_lite_datasets = [
x for x in mmmlu_lite_datasets if 'mmlu_lite_AR-XY' in x['abbr']
]
mathbench_datasets = [x for x in mathbench_datasets if 'college' in x['abbr']]
GaokaoBench_datasets = [
x for x in GaokaoBench_datasets if '2010-2022_Math_II_MCQs' in x['abbr']
or '2010-2022_Math_II_Fill-in-the-Blank' in x['abbr']
]
datasets = sum(
(v for k, v in locals().items() if k.endswith('_datasets')
and 'scicode' not in k.lower() and 'teval' not in k),
[],
)
datasets += teval_en_datasets
datasets += teval_zh_datasets
# datasets += SciCode_datasets
musr_summary_groups = musr_summarizer['summary_groups']
summary_groups = sum(
[v for k, v in locals().items() if k.endswith('_summary_groups')], [])
summary_groups.append(
{
'name': 'Mathbench',
'subsets': ['mathbench-a (average)', 'mathbench-t (average)'],
}, )
# Summarizer
summarizer = dict(
dataset_abbrs=[
'Language',
['race-high', 'accuracy'],
['ARC-c', 'accuracy'],
['BoolQ', 'accuracy'],
['triviaqa_wiki_1shot', 'score'],
['nq_open_1shot', 'score'],
['mmmlu_lite', 'naive_average'],
'',
'Instruction Following',
['IFEval', 'Prompt-level-strict-accuracy'],
'',
'General Reasoning',
['drop', 'accuracy'],
['bbh', 'naive_average'],
['GPQA_diamond', 'accuracy'],
['hellaswag', 'accuracy'],
['TheoremQA', 'score'],
['musr_average', 'naive_average'],
['korbench_single', 'naive_average'],
['ARC_Prize_Public_Evaluation', 'accuracy'],
'',
'Math Calculation',
['gsm8k', 'accuracy'],
['GaokaoBench', 'weighted_average'],
['math', 'accuracy'],
['cmo_fib', 'accuracy'],
['aime2024', 'accuracy'],
['Mathbench', 'naive_average'],
'',
'Knowledge',
['wikibench-wiki-single_choice_cncircular', 'perf_4'],
['cmmlu', 'naive_average'],
['mmlu', 'naive_average'],
['mmlu_pro', 'naive_average'],
'',
'Code',
['openai_humaneval', 'humaneval_pass@1'],
['sanitized_mbpp', 'score'],
['humanevalx', 'naive_average'],
['ds1000', 'naive_average'],
['lcb_code_generation', 'pass@1'],
['lcb_code_execution', 'pass@1'],
['lcb_test_output', 'pass@1'],
['bigcodebench_hard_instruct', 'pass@1'],
['bigcodebench_hard_complete', 'pass@1'],
'',
'Agent',
['teval', 'naive_average'],
['SciCode', 'accuracy'],
['SciCode', 'sub_accuracy'],
'',
'bbh-logical_deduction_seven_objects',
'bbh-multistep_arithmetic_two',
'',
'mmlu',
'mmlu-stem',
'mmlu-social-science',
'mmlu-humanities',
'mmlu-other',
'',
'cmmlu',
'cmmlu-stem',
'cmmlu-social-science',
'cmmlu-humanities',
'cmmlu-other',
'cmmlu-china-specific',
'',
'mmlu_pro',
'mmlu_pro_biology',
'mmlu_pro_business',
'mmlu_pro_chemistry',
'mmlu_pro_computer_science',
'mmlu_pro_economics',
'mmlu_pro_engineering',
'mmlu_pro_health',
'mmlu_pro_history',
'mmlu_pro_law',
'mmlu_pro_math',
'mmlu_pro_philosophy',
'mmlu_pro_physics',
'mmlu_pro_psychology',
'mmlu_pro_other',
'',
'ds1000_Pandas',
'ds1000_Numpy',
'ds1000_Tensorflow',
'ds1000_Scipy',
'ds1000_Sklearn',
'ds1000_Pytorch',
'ds1000_Matplotlib',
'',
'mmmlu_lite',
'openai_mmmlu_lite_AR-XY',
'openai_mmmlu_lite_BN-BD',
'openai_mmmlu_lite_DE-DE',
'openai_mmmlu_lite_ES-LA',
'openai_mmmlu_lite_FR-FR',
'openai_mmmlu_lite_HI-IN',
'openai_mmmlu_lite_ID-ID',
'openai_mmmlu_lite_IT-IT',
'openai_mmmlu_lite_JA-JP',
'openai_mmmlu_lite_KO-KR',
'openai_mmmlu_lite_PT-BR',
'openai_mmmlu_lite_SW-KE',
'openai_mmmlu_lite_YO-NG',
'openai_mmmlu_lite_ZH-CN',
'',
'###### MathBench-A: Application Part ######',
'college',
'high',
'middle',
'primary',
'arithmetic',
'mathbench-a (average)',
'###### MathBench-T: Theory Part ######',
'college_knowledge',
'high_knowledge',
'middle_knowledge',
'primary_knowledge',
'mathbench-t (average)',
],
summary_groups=summary_groups,
)
for d in datasets:
d['reader_cfg']['test_range'] = '[0:16]'
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
for m in models:
m['abbr'] = m['abbr'] + '_fullbench'
if 'turbomind' in m['abbr'] or 'lmdeploy' in m['abbr']:
m['engine_config']['max_batch_size'] = 1
m['batch_size'] = 1
models = sorted(models, key=lambda x: x['run_cfg']['num_gpus'])

View File

@ -0,0 +1,182 @@
from copy import deepcopy
from mmengine.config import read_base
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.runners import LocalRunner
from opencompass.summarizers import DefaultSubjectiveSummarizer
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
with read_base():
# read hf models - chat models
# Dataset
from opencompass.configs.datasets.chinese_simpleqa.chinese_simpleqa_gen import \
csimpleqa_datasets # noqa: F401, E501
from opencompass.configs.datasets.SimpleQA.simpleqa_gen_0283c3 import \
simpleqa_datasets # noqa: F401, E501; noqa: F401, E501
from opencompass.configs.datasets.subjective.alignbench.alignbench_v1_1_judgeby_critiquellm_new import \
alignbench_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.alpaca_eval.alpacav2_judgeby_gpt4_new import \
alpacav2_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.arena_hard.arena_hard_compare_new import \
arenahard_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.compassarena.compassarena_compare_new import \
compassarena_datasets # noqa: F401, E501
# from opencompass.configs.datasets.subjective.fofo.fofo_bilingual_judge_new import fofo_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.followbench.followbench_llmeval_new import \
followbench_llmeval_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.multiround.mtbench101_judge_new import \
mtbench101_datasets # noqa: F401, E501
from opencompass.configs.datasets.subjective.wildbench.wildbench_pair_judge_new import \
wildbench_datasets # noqa: F401, E501
from opencompass.configs.models.hf_internlm.hf_internlm2_5_7b_chat import \
models as hf_internlm2_5_7b_chat_model # noqa: F401, E501
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_5_7b_chat import \
models as lmdeploy_internlm2_5_7b_chat_model # noqa: F401, E501
from ...volc import infer as volc_infer # noqa: F401, E501
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')
and 'mtbench101' not in k and 'wildbench' not in k), [])
datasets += mtbench101_datasets # noqa: F401, E501
datasets += wildbench_datasets # noqa: F401, E501
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
reserved_roles=[dict(role='SYSTEM', api_role='SYSTEM')],
)
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
for m in models:
m['abbr'] = m['abbr'] + '_fullbench'
if 'turbomind' in m['abbr'] or 'lmdeploy' in m['abbr']:
m['engine_config']['max_batch_size'] = 1
m['batch_size'] = 1
models = sorted(models, key=lambda x: x['run_cfg']['num_gpus'])
judge_models = deepcopy([models[1]])
judge_models[0]['abbr'] = judge_models[0]['abbr'] + '-judge'
eval = dict(
partitioner=dict(
type=SubjectiveNaivePartitioner,
models=models,
judge_models=judge_models,
),
runner=dict(type=LocalRunner,
max_num_workers=16,
task=dict(type=SubjectiveEvalTask)),
)
summary_groups = []
summary_groups.append({
'name': 'compassarena_language',
'subsets': [
['compassarena_language', '内容总结'],
],
})
summary_groups.append({
'name': 'compassarena_knowledge',
'subsets': [
['compassarena_knowledge', '生活常识_ZH'],
],
})
summary_groups.append({
'name': 'compassarena_reason_v2',
'subsets': [
['compassarena_reason_v2', 'reasoning'],
],
})
summary_groups.append({
'name': 'compassarena_math_v2',
'subsets': [
['compassarena_math_v2', '高等数学_ZH'],
],
})
summary_groups.append({
'name': 'compassarena_creationv2_zh',
'subsets': [
['compassarena_creationv2_zh', '内容扩写_ZH'],
],
})
summary_groups.append({
'name':
'CompassArena',
'subsets': [
'compassarena_language',
'compassarena_knowledge',
'compassarena_reason_v2',
'compassarena_math_v2',
'compassarena_creationv2_zh',
],
})
summary_groups.append({
'name':
'FoFo',
'subsets': [['fofo_test_prompts', 'overall'],
['fofo_test_prompts_cn', 'overall']],
})
summary_groups.append({
'name':
'Followbench',
'subsets': [
['followbench_llmeval_en', 'HSR_AVG'],
['followbench_llmeval_en', 'SSR_AVG'],
],
})
# Summarizer
summarizer = dict(
dataset_abbrs=[
['alignment_bench_v1_1', '总分'],
['alpaca_eval', 'total'],
['arenahard', 'score'],
['Followbench', 'naive_average'],
['CompassArena', 'naive_average'],
['FoFo', 'naive_average'],
['mtbench101', 'avg'],
['wildbench', 'average'],
['simpleqa', 'accuracy_given_attempted'],
['chinese_simpleqa', 'given_attempted_accuracy'],
'',
['alignment_bench_v1_1', '专业能力'],
['alignment_bench_v1_1', '数学计算'],
['alignment_bench_v1_1', '基本任务'],
['alignment_bench_v1_1', '逻辑推理'],
['alignment_bench_v1_1', '中文理解'],
['alignment_bench_v1_1', '文本写作'],
['alignment_bench_v1_1', '角色扮演'],
['alignment_bench_v1_1', '综合问答'],
['alpaca_eval', 'helpful_base'],
['alpaca_eval', 'koala'],
['alpaca_eval', 'oasst'],
['alpaca_eval', 'selfinstruct'],
['alpaca_eval', 'vicuna'],
['compassarena_language', 'naive_average'],
['compassarena_knowledge', 'naive_average'],
['compassarena_reason_v2', 'naive_average'],
['compassarena_math_v2', 'naive_average'],
['compassarena_creationv2_zh', 'naive_average'],
['fofo_test_prompts', 'overall'],
['fofo_test_prompts_cn', 'overall'],
['followbench_llmeval_en', 'HSR_AVG'],
['followbench_llmeval_en', 'SSR_AVG'],
['followbench_llmeval_en', 'HSR_L1'],
['followbench_llmeval_en', 'HSR_L2'],
['followbench_llmeval_en', 'HSR_L3'],
['followbench_llmeval_en', 'HSR_L4'],
['followbench_llmeval_en', 'HSR_L5'],
['followbench_llmeval_en', 'SSR_L1'],
['followbench_llmeval_en', 'SSR_L2'],
['followbench_llmeval_en', 'SSR_L3'],
['followbench_llmeval_en', 'SSR_L4'],
['followbench_llmeval_en', 'SSR_L5'],
['simpleqa', 'f1'],
],
type=DefaultSubjectiveSummarizer,
summary_groups=summary_groups,
)

View File

@ -6,37 +6,29 @@ import yaml
output_path = 'regression_result_daily'
chat_model_list = [
'baichuan2-7b-chat-hf', 'deepseek-7b-chat-hf', 'deepseek-moe-16b-chat-hf',
'deepseek-v2-lite-chat-hf', 'deepseek-7b-chat-vllm', 'gemma2-2b-it-hf',
'gemma2-9b-it-hf', 'gemma-7b-it-vllm', 'internlm2_5-7b-chat-hf',
'internlm2_5-20b-chat-hf', 'internlm2_5-7b-chat-turbomind',
'internlm2_5-20b-chat-turbomind', 'internlm2-chat-1.8b-turbomind',
'internlm2-chat-1.8b-sft-turbomind', 'internlm2-chat-7b-lmdeploy',
'internlm2-chat-7b-sft-turbomind', 'internlm2-chat-7b-vllm',
'llama-3_1-8b-instruct-hf', 'llama-3-8b-instruct-hf',
'llama-3_1-8b-instruct-turbomind', 'llama-3-8b-instruct-turbomind',
'mistral-7b-instruct-v0.3-hf', 'mistral-7b-instruct-v0.2-vllm',
'minicpm-2b-dpo-fp32-hf', 'minicpm-2b-sft-bf16-hf',
'minicpm-2b-sft-fp32-hf', 'phi-3-mini-4k-instruct-hf',
'qwen1.5-0.5b-chat-hf', 'qwen2-1.5b-instruct-hf', 'qwen2-7b-instruct-hf',
'qwen2-1.5b-instruct-turbomind', 'qwen2-7b-instruct-turbomind',
'qwen1.5-0.5b-chat-vllm', 'yi-1.5-6b-chat-hf', 'yi-1.5-9b-chat-hf',
'lmdeploy-api-test'
]
base_model_list = [
'deepseek-moe-16b-base-hf', 'deepseek-v2-lite-hf',
'deepseek-7b-base-turbomind', 'deepseek-moe-16b-base-vllm', 'gemma2-2b-hf',
'gemma2-9b-hf', 'internlm2_5-7b-hf', 'internlm2-7b-hf',
'internlm2-base-7b-hf', 'internlm2-1.8b-turbomind',
'internlm2_5-7b-turbomind', 'internlm2-7b-turbomind',
'internlm2-base-7b-turbomind', 'llama-2-7b-hf', 'llama-3-8b-hf',
'llama-3.1-8b-turbomind', 'llama-3-8b-turbomind', 'mistral-7b-v0.3-hf',
'mistral-7b-v0.2-vllm', 'qwen1.5-moe-a2.7b-hf', 'qwen2-0.5b-hf',
'qwen2-1.5b-hf', 'qwen2-7b-hf', 'qwen2-1.5b-turbomind',
'qwen2-7b-turbomind', 'qwen1.5-0.5b-vllm', 'yi-1.5-6b-hf', 'yi-1.5-9b-hf'
]
dataset_list = ['gsm8k', 'race-middle', 'race-high']
def model_list(type):
config_path = '.github/scripts/oc_score_baseline_testrange.yaml'
with open(config_path) as f:
config = yaml.load(f.read(), Loader=yaml.SafeLoader)
return config.get(type).keys()
def dataset_list(model, type):
config_path = '.github/scripts/oc_score_baseline_fullbench.yaml'
with open(config_path) as f:
config = yaml.load(f.read(), Loader=yaml.SafeLoader)
return config.get(model).get(type).keys()
@pytest.fixture()
def baseline_scores_testrange(request):
config_path = os.path.join(
request.config.rootdir,
'.github/scripts/oc_score_baseline_testrange.yaml')
with open(config_path) as f:
config = yaml.load(f.read(), Loader=yaml.SafeLoader)
return config
@pytest.fixture()
@ -48,6 +40,16 @@ def baseline_scores(request):
return config
@pytest.fixture()
def baseline_scores_fullbench(request):
config_path = os.path.join(
request.config.rootdir,
'.github/scripts/oc_score_baseline_fullbench.yaml')
with open(config_path) as f:
config = yaml.load(f.read(), Loader=yaml.SafeLoader)
return config
@pytest.fixture()
def result_scores():
file = find_csv_files(output_path)
@ -57,100 +59,294 @@ def result_scores():
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores')
@pytest.mark.chat
@pytest.mark.usefixtures('baseline_scores_testrange')
@pytest.mark.chat_models
class TestChat:
"""Test cases for chat model."""
@pytest.mark.parametrize('model, dataset', [(p1, p2)
for p1 in chat_model_list
for p2 in dataset_list])
def test_model_dataset_score(self, baseline_scores, result_scores, model,
dataset):
base_score = baseline_scores.get(model).get(dataset)
@pytest.mark.parametrize(
'model, dataset', [(p1, p2) for p1 in model_list('chat')
for p2 in ['gsm8k_accuracy', 'race-high_accuracy']])
def test_model_dataset_score(self, baseline_scores_testrange,
result_scores, model, dataset):
base_score = baseline_scores_testrange.get('chat').get(model).get(
dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(result_score, base_score)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores_testrange')
@pytest.mark.base_models
class TestBase:
"""Test cases for base model."""
@pytest.mark.parametrize('model, dataset',
[(p1, p2) for p1 in model_list('base') for p2 in [
'gsm8k_accuracy', 'GPQA_diamond_accuracy',
'race-high_accuracy', 'winogrande_accuracy'
]])
def test_model_dataset_score(self, baseline_scores_testrange,
result_scores, model, dataset):
if model in ['gemma-2b-vllm', 'gemma-7b-vllm'
] and dataset != 'gsm8k_accuracy':
return
base_score = baseline_scores_testrange.get('base').get(model).get(
dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores_fullbench')
@pytest.mark.chat_obj_fullbench
class TestChatObjFullbench:
"""Test cases for chat model."""
@pytest.mark.parametrize('model, dataset', [(p1, p2) for p1 in [
'internlm2_5-7b-chat-hf_fullbench',
'internlm2_5-7b-chat-turbomind_fullbench'
] for p2 in dataset_list('internlm2_5-7b-chat-hf_fullbench', 'objective')])
def test_model_dataset_score(self, baseline_scores_fullbench,
result_scores, model, dataset):
base_score = baseline_scores_fullbench.get(model).get('objective').get(
dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores_fullbench')
@pytest.mark.chat_sub_fullbench
class TestChatSubFullbench:
"""Test cases for chat model."""
@pytest.mark.parametrize('model, dataset', [(p1, p2) for p1 in [
'internlm2_5-7b-chat-hf_fullbench',
'internlm2_5-7b-chat-turbomind_fullbench'
] for p2 in dataset_list('internlm2_5-7b-chat-hf_fullbench', 'subjective')]
)
def test_model_dataset_score(self, baseline_scores_fullbench,
result_scores, model, dataset):
base_score = baseline_scores_fullbench.get(model).get(
'subjective').get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores_fullbench')
@pytest.mark.base_fullbench
class TestBaseFullbench:
"""Test cases for chat model."""
@pytest.mark.parametrize(
'model, dataset',
[(p1, p2) for p1 in
['internlm2_5-7b-hf_fullbench', 'internlm2_5-7b-turbomind_fullbench']
for p2 in dataset_list('internlm2_5-7b-hf_fullbench', 'objective')])
def test_model_dataset_score(self, baseline_scores_fullbench,
result_scores, model, dataset):
base_score = baseline_scores_fullbench.get(model).get('objective').get(
dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores')
@pytest.mark.base
class TestBase:
"""Test cases for base model."""
@pytest.mark.api
class TestApibench:
"""Test cases for chat model."""
@pytest.mark.parametrize('model, dataset', [(p1, p2)
for p1 in base_model_list
for p2 in dataset_list])
def test_model_dataset_score(self, baseline_scores, result_scores, model,
dataset):
if model == 'mistral-7b-v0.2-vllm' and dataset == 'race-high':
return
@pytest.mark.parametrize('model, dataset',
[('lmdeploy-api-test', 'race-middle_accuracy'),
('lmdeploy-api-test', 'race-high_accuracy'),
('lmdeploy-api-test', 'gsm8k_accuracy')])
def test_api(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(result_score, base_score)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores_fullbench')
@pytest.mark.volc_fullbench
class TestVolcFullbench:
"""Test cases for chat model."""
@pytest.mark.parametrize('model, dataset', [(p1, p2) for p1 in [
'internlm2_5-7b-chat-turbomind', 'qwen2.5-7b-instruct-turbomind',
'internlm2_5-7b-chat-pytorch', 'qwen2.5-7b-instruct-pytorch',
'internlm3-8b-instruct-turbomind', 'internlm3-8b-instruct-pytorch'
] for p2 in dataset_list(p1, 'objective')])
@pytest.mark.chat_objective
def test_chat_objective(self, baseline_scores_fullbench, result_scores,
model, dataset):
base_score = baseline_scores_fullbench.get(model).get('objective').get(
dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.parametrize('model, dataset', [
(p1, p2) for p1 in ['internlm2_5-7b-chat-turbomind']
for p2 in dataset_list('internlm2_5-7b-chat-turbomind', 'subjective')
])
@pytest.mark.chat_subjective
def test_chat_subjective(self, baseline_scores_fullbench, result_scores,
model, dataset):
base_score = baseline_scores_fullbench.get(model).get(
'subjective').get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.parametrize(
'model, dataset',
[(p1, p2) for p1 in ['internlm2_5-7b-turbomind']
for p2 in dataset_list('internlm2_5-7b-turbomind', 'objective')])
@pytest.mark.base_objective
def test_base_objective(self, baseline_scores_fullbench, result_scores,
model, dataset):
base_score = baseline_scores_fullbench.get(model).get('objective').get(
dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.parametrize(
'model, dataset',
[(p1, p2) for p1 in ['internlm2_5-7b-turbomind']
for p2 in dataset_list('internlm2_5-7b-turbomind', 'long_context')])
@pytest.mark.base_long_context
def test_base_long_context(self, baseline_scores_fullbench, result_scores,
model, dataset):
base_score = baseline_scores_fullbench.get(model).get(
'long_context').get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.parametrize(
'model, dataset',
[(p1, p2)
for p1 in ['internlm2_5-7b-chat-1m-turbomind'] for p2 in dataset_list(
'internlm2_5-7b-chat-1m-turbomind', 'long_context')])
@pytest.mark.chat_long_context
def test_chat_long_context(self, baseline_scores_fullbench, result_scores,
model, dataset):
base_score = baseline_scores_fullbench.get(model).get(
'long_context').get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.usefixtures('result_scores')
@pytest.mark.usefixtures('baseline_scores')
class TestCmdCase:
@pytest.mark.case1
@pytest.mark.parametrize('model, dataset',
[('internlm2_5-7b-hf', 'race-middle'),
('internlm2_5-7b-hf', 'race-high')])
def test_cmd_case1(self, result_scores, model, dataset):
if len(result_scores.keys()) != 1:
assert False, 'result is none'
[('internlm2_5-7b-hf', 'race-middle_accuracy'),
('internlm2_5-7b-hf', 'race-high_accuracy'),
('internlm2_5-7b-hf', 'demo_gsm8k_accuracy')])
def test_cmd_case1(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(result_score, 91)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.case2
@pytest.mark.parametrize('model, dataset',
[('internlm2_5-7b-chat-lmdeploy', 'race-middle'),
('internlm2_5-7b-chat-lmdeploy', 'race-high')])
def test_cmd_case2(self, result_scores, model, dataset):
if len(result_scores.keys()) != 1:
assert False, 'result is none'
@pytest.mark.parametrize(
'model, dataset',
[('internlm2_5-7b-chat-lmdeploy', 'race-middle_accuracy'),
('internlm2_5-7b-chat-lmdeploy', 'race-high_accuracy'),
('internlm2_5-7b-chat-lmdeploy', 'demo_gsm8k_accuracy'),
('internlm3-8b-instruct-lmdeploy', 'race-middle_accuracy'),
('internlm3-8b-instruct-lmdeploy', 'race-high_accuracy'),
('internlm3-8b-instruct-lmdeploy', 'demo_gsm8k_accuracy')])
def test_cmd_case2(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(result_score, 91)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.case3
@pytest.mark.parametrize('model, dataset',
[('internlm2_5-7b_hf', 'race-middle'),
('internlm2_5-7b_hf', 'race-high')])
def test_cmd_case3(self, result_scores, model, dataset):
if len(result_scores.keys()) != 1:
assert False, 'result is none'
[('internlm2_5-7b_hf', 'race-middle_accuracy'),
('internlm2_5-7b_hf', 'race-high_accuracy'),
('internlm2_5-7b_hf', 'demo_gsm8k_accuracy')])
def test_cmd_case3(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(result_score, 91)
assert_score(model, result_score, base_score, dataset)
@pytest.mark.case4
@pytest.mark.parametrize('model, dataset',
[('internlm2_5-7b-chat_hf', 'race-middle'),
('internlm2_5-7b-chat_hf', 'race-high')])
def test_cmd_case4(self, result_scores, model, dataset):
if len(result_scores.keys()) != 1:
assert False, 'result is none'
@pytest.mark.parametrize(
'model, dataset',
[('internlm3-8b-instruct_hf-lmdeploy', 'race-middle_accuracy'),
('internlm3-8b-instruct_hf-lmdeploy', 'race-high_accuracy'),
('internlm3-8b-instruct_hf-lmdeploy', 'demo_gsm8k_accuracy')])
def test_cmd_case4(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(result_score, 91)
assert_score(model + '_batch', result_score, base_score, dataset)
@pytest.mark.case5
@pytest.mark.parametrize(
'model, dataset',
[('internlm3-8b-instruct_hf-vllm', 'race-middle_accuracy'),
('internlm3-8b-instruct_hf-vllm', 'race-high_accuracy'),
('internlm3-8b-instruct_hf-vllm', 'demo_gsm8k_accuracy')])
def test_cmd_case5(self, baseline_scores, result_scores, model, dataset):
base_score = baseline_scores.get(model).get(dataset)
result_score = result_scores.get(model).get(dataset)
assert_score(model + '_batch', result_score, base_score, dataset)
def assert_score(score, baseline):
def assert_score(model_type, score, baseline, dataset: str = ''):
if score is None or score == '-':
assert False, 'value is none'
if float(score) <= (baseline + 5) and float(score) >= (baseline - 5):
print(score + ' between ' + str(baseline - 5) + ' and ' +
str(baseline + 5))
assert True
if 'batch' not in model_type:
if float(score) <= (baseline + 0.01) and float(score) >= (baseline -
0.01):
print(' '.join([score, 'is equal', str(baseline)]))
assert True
else:
print(' '.join([score, 'is not equal', str(baseline)]))
assert False, ' '.join([score, 'is not equal', str(baseline)])
else:
assert False, score + ' not between ' + str(
baseline - 5) + ' and ' + str(baseline + 5)
if dataset.startswith('dingo') or dataset.startswith(
'GPQA') or dataset.startswith('high') or dataset.startswith(
'mmlu_pro_') or dataset.startswith(
'alpaca_eval') or dataset.startswith('compassarena_'):
threshold = 5
elif dataset.startswith('humanevalx') or dataset == 'large_threshold':
threshold = 10
else:
threshold = 3
if float(score) <= (baseline + threshold) and float(score) >= (
baseline - threshold):
print(' '.join([
score, 'is between',
str(baseline - threshold), 'and',
str(baseline + threshold)
]))
assert True
else:
print(' '.join([
score, 'is not between',
str(baseline - threshold), 'and',
str(baseline + threshold)
]))
assert False, ' '.join([
score, 'is not between',
str(baseline - threshold), 'and',
str(baseline + threshold)
])
def find_csv_files(directory):
csv_files = []
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith('.csv'):
if file.endswith('.csv') and file.startswith('summary'):
csv_files.append(os.path.join(root, file))
csv_files_with_time = {f: os.path.getctime(f) for f in csv_files}
@ -163,14 +359,15 @@ def read_csv_file(file_path):
with open(file_path, 'r') as csvfile:
reader = csv.DictReader(csvfile)
filtered_data = []
for row in reader:
filtered_row = {
k: v
for k, v in row.items()
if k not in ['version', 'metric', 'mode']
}
filtered_data.append(filtered_row)
if row['metric'] is not None and 'bpb' not in row[
'metric'] and '_' != row['metric']:
filtered_row = row
filtered_row['dataset'] = row['dataset'] + '_' + row['metric']
del filtered_row['version']
del filtered_row['metric']
del filtered_row['mode']
filtered_data.append(filtered_row)
result = {}
for data in filtered_data:

View File

@ -1,369 +1,39 @@
baichuan2-7b-chat-hf:
gsm8k: 30
race-middle: 74
race-high: 79
internlm2_5-7b-hf:
demo_gsm8k_accuracy: 42.19
race-middle_accuracy: 91.78
race-high_accuracy: 90.02
glm-4-9b-chat-hf:
gsm8k: 75
race-middle: 88
race-high: 88
internlm2_5-7b_hf:
demo_gsm8k_accuracy: 42.19
race-middle_accuracy: 91.78
race-high_accuracy: 90.02
glm-4-9b-chat-turbomind:
gsm8k: 69
race-middle: 82
race-high: 77
internlm2_5-7b-chat-lmdeploy:
demo_gsm8k_accuracy: 84.38
race-middle_accuracy: 92.76
race-high_accuracy: 90.54
glm-4-9b-chat-vllm:
gsm8k: 73
race-middle: 87
race-high: 87
internlm3-8b-instruct-lmdeploy:
demo_gsm8k_accuracy: 73.44
race-middle_accuracy: 93.38
race-high_accuracy: 90.34
deepseek-7b-chat-hf:
gsm8k: 60
race-middle: 74
race-high: 80
internlm3-8b-instruct_hf-lmdeploy:
demo_gsm8k_accuracy: 73.44
race-middle_accuracy: 93.38
race-high_accuracy: 90.34
deepseek-moe-16b-chat-hf:
gsm8k: 62
race-middle: 62
race-high: 70
internlm3-8b-instruct_hf-vllm:
demo_gsm8k_accuracy: 78.12
race-middle_accuracy: 92.20
race-high_accuracy: 89.88
deepseek-v2-lite-chat-hf:
gsm8k: 59
race-middle: 82
race-high: 79
deepseek-7b-chat-vllm:
gsm8k: 63
race-middle: 74
race-high: 79
gemma-2b-it-hf:
gsm8k: 14
race-middle: 62
race-high: 52
gemma-7b-it-hf:
gsm8k: 39
race-middle: 74
race-high: 71
gemma-7b-it-vllm:
gsm8k: 38
race-middle: 75
race-high: 70
gemma2-2b-it-hf:
gsm8k: 62
race-middle: 75
race-high: 67
gemma2-9b-it-hf:
gsm8k: 80
race-middle: 89
race-high: 85
internlm2_5-7b-chat-hf:
gsm8k: 86
race-middle: 92
race-high: 93
internlm2_5-20b-chat-hf:
gsm8k: 91
race-middle: 95
race-high: 91
internlm2_5-7b-chat-turbomind:
gsm8k: 87
race-middle: 92
race-high: 93
internlm2_5-20b-chat-turbomind:
gsm8k: 91
race-middle: 95
race-high: 91
internlm2-chat-1.8b-turbomind:
gsm8k: 40
race-middle: 82
race-high: 83
internlm2-chat-1.8b-sft-turbomind:
gsm8k: 34
race-middle: 81
race-high: 83
internlm2-chat-7b-lmdeploy:
gsm8k: 69
race-middle: 90
race-high: 88
internlm2-chat-7b-sft-turbomind:
gsm8k: 71
race-middle: 91
race-high: 92
internlm2-chat-7b-vllm:
gsm8k: 63
race-middle: 90
race-high: 91
llama-3_1-8b-instruct-hf:
gsm8k: 82
race-middle: 82
race-high: 88
llama-3-8b-instruct-hf:
gsm8k: 77
race-middle: 85
race-high: 87
llama-3_1-8b-instruct-turbomind:
gsm8k: 79
race-middle: 82
race-high: 88
llama-3-8b-instruct-turbomind:
gsm8k: 77
race-middle: 85
race-high: 89
mistral-7b-instruct-v0.2-hf:
gsm8k: 48
race-middle: 82
race-high: 78
mistral-7b-instruct-v0.3-hf:
gsm8k: 53
race-middle: 80
race-high: 78
mistral-7b-instruct-v0.2-vllm:
gsm8k: 49
race-middle: 81
race-high: 77
minicpm-2b-dpo-fp32-hf:
gsm8k: 58
race-middle: 66
race-high: 74
minicpm-2b-sft-bf16-hf:
gsm8k: 58
race-middle: 75
race-high: 81
minicpm-2b-sft-fp32-hf:
gsm8k: 58
race-middle: 75
race-high: 81
phi-3-mini-4k-instruct-hf:
gsm8k: 67
race-middle: 81
race-high: 84
phi-3-small-8k-instruct-hf:
gsm8k: 88
race-middle: 89
race-high: 88
qwen1.5-0.5b-chat-hf:
gsm8k: 5
race-middle: 55
race-high: 50
qwen2-1.5b-instruct-hf:
gsm8k: 63
race-middle: 77
race-high: 86
qwen2-1.5b-instruct-turbomind:
gsm8k: 60
race-middle: 77
race-high: 86
qwen2-7b-instruct-turbomind:
gsm8k: 88
race-middle: 87
race-high: 89
qwen2-7b-instruct-hf:
gsm8k: 85
race-middle: 87
race-high: 91
qwen1.5-0.5b-chat-vllm:
gsm8k: 5
race-middle: 57
race-high: 51
yi-1.5-6b-chat-hf:
gsm8k: 72
race-middle: 88
race-high: 86
yi-1.5-9b-chat-hf:
gsm8k: 81
race-middle: 89
race-high: 91
internlm2_5-7b-chat_hf:
demo_gsm8k_accuracy: 87.50
race-middle_accuracy: 92.76
race-high_accuracy: 90.48
lmdeploy-api-test:
gsm8k: 90
race-middle: 95
race-high: 96
deepseek-moe-16b-base-hf:
gsm8k: 25
race-middle: 35
race-high: 23
deepseek-v2-lite-hf:
gsm8k: 37
race-middle: 56
race-high: 62
deepseek-7b-base-turbomind:
gsm8k: 21
race-middle: 42
race-high: 42
deepseek-moe-16b-base-vllm:
gsm8k: 22
race-middle: 35
race-high: 20
gemma-2b-hf:
gsm8k: 19
race-middle: 33
race-high: 26
gemma-7b-hf:
gsm8k: 65
race-middle: 59
race-high: 66
gemma2-2b-hf:
gsm8k: 33
race-middle: 56
race-high: 58
gemma2-9b-hf:
gsm8k: 70
race-middle: 82
race-high: 84
internlm2_5-7b-hf:
gsm8k: 47
race-middle: 92
race-high: 91
internlm2-7b-hf:
gsm8k: 65
race-middle: 77
race-high: 72
internlm2-base-7b-hf:
gsm8k: 5
race-middle: 71
race-high: 74
internlm2_5-7b-turbomind:
gsm8k: 73
race-middle: 90
race-high: 91
internlm2-1.8b-turbomind:
gsm8k: 25
race-middle: 75
race-high: 72
internlm2-7b-turbomind:
gsm8k: 67
race-middle: 78
race-high: 76
internlm2-base-7b-turbomind:
gsm8k: 39
race-middle: 75
race-high: 81
llama-2-7b-hf:
gsm8k: 17
race-middle: 32
race-high: 38
llama-3-8b-hf:
gsm8k: 48
race-middle: 64
race-high: 70
llama-3.1-8b-turbomind:
gsm8k: 57
race-middle: 67
race-high: 75
llama-3-8b-turbomind:
gsm8k: 52
race-middle: 63
race-high: 70
mistral-7b-v0.2-hf:
gsm8k: 43
race-middle: 42
race-high: 60
mistral-7b-v0.3-hf:
gsm8k: 43
race-middle: 42
race-high: 60
mistral-7b-v0.2-vllm:
gsm8k: 45
race-middle: 42
race-high: 58
qwen1.5-moe-a2.7b-hf:
gsm8k: 64
race-middle: 78
race-high: 90
qwen2-1.5b-hf:
gsm8k: 58
race-middle: 65
race-high: 78
qwen2-0.5b-hf:
gsm8k: 35
race-middle: 52
race-high: 48
qwen2-7b-hf:
gsm8k: 82
race-middle: 88
race-high: 89
qwen2-1.5b-turbomind:
gsm8k: 57
race-middle: 64
race-high: 78
qwen2-7b-turbomind:
gsm8k: 83
race-middle: 88
race-high: 88
qwen1.5-0.5b-vllm:
gsm8k: 12
race-middle: 54
race-high: 59
yi-1.5-6b-hf:
gsm8k: 59
race-middle: 81
race-high: 89
yi-1.5-9b-hf:
gsm8k: 77
race-middle: 90
race-high: 90
gsm8k_accuracy: 68.75
race-middle_accuracy: 93.75
race-high_accuracy: 93.75

View File

@ -0,0 +1,983 @@
internlm2_5-7b-chat-hf_fullbench:
objective:
race-high_accuracy: 93.75
ARC-c_accuracy: 93.75
BoolQ_accuracy: 81.25
triviaqa_wiki_1shot_score: 50
nq_open_1shot_score: 25
IFEval_Prompt-level-strict-accuracy: 50
drop_accuracy: 81.25
GPQA_diamond_accuracy: 25
hellaswag_accuracy: 87.5
TheoremQA_score: 12.50
musr_average_naive_average: 39.58
korbench_single_naive_average: 40
gsm8k_accuracy: 62.50
math_accuracy: 75
cmo_fib_accuracy: 6.25
aime2024_accuracy: 6.25
wikibench-wiki-single_choice_cncircular_perf_4: 50
sanitized_mbpp_score: 68.75
ds1000_naive_average: 16.96
lcb_code_generation_pass@1: 12.5
lcb_code_execution_pass@1: 43.75
lcb_test_output_pass@1: 18.75
bbh-logical_deduction_seven_objects_score: 50
bbh-multistep_arithmetic_two_score: 68.75
mmlu-other_accuracy: 72.6
cmmlu-china-specific_accuracy: 76.25
mmlu_pro_math_accuracy: 25
ds1000_Pandas_accuracy: 12.5
ds1000_Numpy_accuracy: 0
ds1000_Tensorflow_accuracy: 12.5
ds1000_Scipy_accuracy: 18.75
ds1000_Sklearn_accuracy: 18.75
ds1000_Pytorch_accuracy: 12.5
ds1000_Matplotlib_accuracy: 43.75
openai_mmmlu_lite_AR-XY_accuracy: 37.5
college_naive_average: 12.5
college_knowledge_naive_average: 87.5
subjective:
alignment_bench_v1_1_总分: 0.66
alpaca_eval_total: 20.00
arenahard_score: 56.82
Followbench_naive_average: 1
CompassArena_naive_average: 43
mtbench101_avg: 7.60
wildbench_average: -14.58
simpleqa_accuracy_given_attempted: 1.00
chinese_simpleqa_given_attempted_accuracy: 0.90
alignment_bench_v1_1_专业能力: 7.90
alignment_bench_v1_1_数学计算: 0
alignment_bench_v1_1_基本任务: 0
alignment_bench_v1_1_逻辑推理: 0
alignment_bench_v1_1_中文理解: 0
alignment_bench_v1_1_文本写作: 0
alignment_bench_v1_1_角色扮演: 0
alignment_bench_v1_1_综合问答: 0
alpaca_eval_helpful_base: 20.00
compassarena_language_naive_average: 35
compassarena_knowledge_naive_average: 60.00
compassarena_reason_v2_naive_average: 40
compassarena_math_v2_naive_average: 50.00
compassarena_creationv2_zh_naive_average: 30
followbench_llmeval_en_HSR_AVG: 1
followbench_llmeval_en_SSR_AVG: 1
followbench_llmeval_en_HSR_L1: 1
followbench_llmeval_en_HSR_L2: 1
followbench_llmeval_en_HSR_L3: 1
followbench_llmeval_en_HSR_L4: 1
followbench_llmeval_en_HSR_L5: 1
followbench_llmeval_en_SSR_L1: 1
followbench_llmeval_en_SSR_L2: 1
followbench_llmeval_en_SSR_L3: 1
followbench_llmeval_en_SSR_L4: 1
followbench_llmeval_en_SSR_L5: 1
simpleqa_f1: 0.12
internlm2_5-7b-chat-turbomind_fullbench:
objective:
race-high_accuracy: 93.75
ARC-c_accuracy: 93.75
BoolQ_accuracy: 75.00
triviaqa_wiki_1shot_score: 50
nq_open_1shot_score: 25
IFEval_Prompt-level-strict-accuracy: 56.25
drop_accuracy: 75
GPQA_diamond_accuracy: 37.50
hellaswag_accuracy: 81.25
TheoremQA_score: 12.5
musr_average_naive_average: 39.58
korbench_single_naive_average: 40
gsm8k_accuracy: 68.75
math_accuracy: 68.75
cmo_fib_accuracy: 6.25
aime2024_accuracy: 6.25
wikibench-wiki-single_choice_cncircular_perf_4: 25
sanitized_mbpp_score: 68.75
ds1000_naive_average: 15.18
lcb_code_generation_pass@1: 12.5
lcb_code_execution_pass@1: 43.75
lcb_test_output_pass@1: 0.00
bbh-logical_deduction_seven_objects_score: 62.50
bbh-multistep_arithmetic_two_score: 62.50
mmlu-other_accuracy: 73.08
cmmlu-china-specific_accuracy: 75.42
mmlu_pro_math_accuracy: 25.00
ds1000_Pandas_accuracy: 0.00
ds1000_Numpy_accuracy: 0
ds1000_Tensorflow_accuracy: 12.5
ds1000_Scipy_accuracy: 18.75
ds1000_Sklearn_accuracy: 18.75
ds1000_Pytorch_accuracy: 12.50
ds1000_Matplotlib_accuracy: 43.75
openai_mmmlu_lite_AR-XY_accuracy: 37.5
college_naive_average: 12.50
college_knowledge_naive_average: 87.5
subjective:
alignment_bench_v1_1_总分: 0.72
alpaca_eval_total: 20.00
arenahard_score: 55.77
Followbench_naive_average: 1
CompassArena_naive_average: 39.00
mtbench101_avg: 7.90
wildbench_average: 0.00
simpleqa_accuracy_given_attempted: 1.00
chinese_simpleqa_given_attempted_accuracy: 1
alignment_bench_v1_1_专业能力: 8.70
alignment_bench_v1_1_数学计算: 0
alignment_bench_v1_1_基本任务: 0
alignment_bench_v1_1_逻辑推理: 0
alignment_bench_v1_1_中文理解: 0
alignment_bench_v1_1_文本写作: 0
alignment_bench_v1_1_角色扮演: 0
alignment_bench_v1_1_综合问答: 0
alpaca_eval_helpful_base: 20.00
compassarena_language_naive_average: 25.00
compassarena_knowledge_naive_average: 55.00
compassarena_reason_v2_naive_average: 35.00
compassarena_math_v2_naive_average: 55.00
compassarena_creationv2_zh_naive_average: 25.00
followbench_llmeval_en_HSR_AVG: 1
followbench_llmeval_en_SSR_AVG: 1
followbench_llmeval_en_HSR_L1: 1
followbench_llmeval_en_HSR_L2: 1
followbench_llmeval_en_HSR_L3: 1
followbench_llmeval_en_HSR_L4: 1
followbench_llmeval_en_HSR_L5: 1
followbench_llmeval_en_SSR_L1: 1
followbench_llmeval_en_SSR_L2: 1
followbench_llmeval_en_SSR_L3: 1
followbench_llmeval_en_SSR_L4: 1
followbench_llmeval_en_SSR_L5: 1
simpleqa_f1: 0.12
internlm2_5-7b-hf_fullbench:
objective:
race-high_accuracy: 100
ARC-c_accuracy: 68.75
BoolQ_accuracy: 87.5
triviaqa_wiki_1shot_score: 43.75
nq_open_1shot_score: 43.75
drop_accuracy: 62.5
GPQA_diamond_accuracy: 62.5
hellaswag_accuracy: 93.75
TheoremQA_score: 18.75
winogrande_accuracy: 75
gsm8k_accuracy: 37.5
GaokaoBench_2010-2022_Math_II_MCQs_score: 62.5
GaokaoBench_2010-2022_Math_II_Fill-in-the-Blank_score: 0
math_accuracy: 12.5
wikibench-wiki-single_choice_cncircular_perf_4: 25
sanitized_mbpp_score: 56.25
dingo_en_192_score: 37.5
dingo_zh_170_score: 100
mmlu-other_accuracy: 76.92
cmmlu-china-specific_accuracy: 84.17
mmlu_pro_math_accuracy: 18.75
bbh-logical_deduction_seven_objects_score: 43.75
bbh-multistep_arithmetic_two_score: 56.25
college_naive_average: 12.5
college_knowledge_naive_average: 87.5
internlm2_5-7b-turbomind_fullbench:
objective:
race-high_accuracy: 100
ARC-c_accuracy: 68.75
BoolQ_accuracy: 87.5
triviaqa_wiki_1shot_score: 43.75
nq_open_1shot_score: 43.75
drop_accuracy: 62.5
GPQA_diamond_accuracy: 68.75
hellaswag_accuracy: 93.75
TheoremQA_score: 18.75
winogrande_accuracy: 87.5
gsm8k_accuracy: 62.50
GaokaoBench_2010-2022_Math_II_MCQs_score: 93.75
GaokaoBench_2010-2022_Math_II_Fill-in-the-Blank_score: 0
math_accuracy: 6.25
wikibench-wiki-single_choice_cncircular_perf_4: 0.00
sanitized_mbpp_score: 62.50
dingo_en_192_score: 37.50
dingo_zh_170_score: 100.00
mmlu-other_accuracy: 78.37
cmmlu-china-specific_accuracy: 83.33
mmlu_pro_math_accuracy: 18.75
bbh-logical_deduction_seven_objects_score: 62.50
bbh-multistep_arithmetic_two_score: 50.00
college_naive_average: 12.5
college_knowledge_naive_average: 87.5
internlm2_5-7b-turbomind:
objective:
race-high_accuracy: 89.28
ARC-c_accuracy: 52.2
BoolQ_accuracy: 89.72
triviaqa_wiki_1shot_score: 65.88
nq_open_1shot_score: 34.82
drop_accuracy: 68.1
bbh_naive_average: 72.15
GPQA_diamond_accuracy: 32.83
hellaswag_accuracy: 88.36
TheoremQA_score: 25
winogrande_accuracy: 81.29
gsm8k_accuracy: 74.68
GaokaoBench_weighted_average: 58.19
math_accuracy: 33.98
Mathbench_naive_average: 48.38
wikibench-wiki-single_choice_cncircular_perf_4: 29.1
cmmlu_naive_average: 78.94
mmlu_naive_average: 71.44
mmlu_pro_naive_average: 38.18
openai_humaneval_humaneval_pass@1: 59.76
openai_humaneval_v2_humaneval_pass@1: 57.93
sanitized_mbpp_score: 55.25
dingo_en_192_score: 60.94
dingo_zh_170_score: 67.65
mmlu-stem_accuracy: 63.72
mmlu-social-science_accuracy: 80.15
mmlu-humanities_accuracy: 74.27
mmlu-other_accuracy: 71.85
cmmlu-stem_accuracy: 67.07
cmmlu-social-science_accuracy: 81.49
cmmlu-humanities_accuracy: 85.84
cmmlu-other_accuracy: 82.69
cmmlu-china-specific_accuracy: 79.88
mmlu_pro_biology_accuracy: 58.58
mmlu_pro_business_accuracy: 28.01
mmlu_pro_chemistry_accuracy: 22.79
mmlu_pro_computer_science_accuracy: 39.02
mmlu_pro_economics_accuracy: 53.08
mmlu_pro_engineering_accuracy: 25.7
mmlu_pro_health_accuracy: 46.94
mmlu_pro_history_accuracy: 43.04
mmlu_pro_law_accuracy: 29.7
mmlu_pro_math_accuracy: 24.2
mmlu_pro_philosophy_accuracy: 42.48
mmlu_pro_physics_accuracy: 26.02
mmlu_pro_psychology_accuracy: 52.76
mmlu_pro_other_accuracy: 42.21
college_naive_average: 7.00
high_naive_average: 6.67
middle_naive_average: 26.67
primary_naive_average: 64.00
arithmetic_naive_average: 55
mathbench-a (average)_naive_average: 31.8
college_knowledge_naive_average: 58.23
high_knowledge_naive_average: 52.51
middle_knowledge_naive_average: 71.15
primary_knowledge_naive_average: 60.48
mathbench-t (average)_naive_average: 60.19
long_context:
Single-Needle-Retrieval(S-RT)-32000_naive_average: 100
Single-Needle-Retrieval-EN-32000_naive_average: 100
Single-Needle-Retrieval-ZH-32000_naive_average: 100
Single-Needle-Retrieval(S-RT)-100000_naive_average: 100
Single-Needle-Retrieval-EN-100000_naive_average: 100
Single-Needle-Retrieval-ZH-100000_naive_average: 100
Single-Needle-Retrieval(S-RT)-200000_naive_average: 100
Single-Needle-Retrieval-EN-200000_naive_average: 100
Single-Needle-Retrieval-ZH-200000_naive_average: 100
longbench_naive_average: 46.19
longbench_zh_naive_average: 49.3
longbench_en_naive_average: 43.97
longbench_single-document-qa_score: 42.84
longbench_multi-document-qa_score: 41.25
longbench_summarization_score: 23.21
longbench_few-shot-learning_score: 61.67
longbench_synthetic-tasks_score: 60.05
longbench_code-completion_score: 52.09
internlm2_5-7b-chat-turbomind:
objective:
race-high_accuracy: 86.16
ARC-c_accuracy: 90.17
BoolQ_accuracy: 87.89
triviaqa_wiki_1shot_score: 64.91
nq_open_1shot_score: 22.69
mmmlu_lite_naive_average: 44.96
IFEval_Prompt-level-strict-accuracy: 58.04
drop_accuracy: 77.68
bbh_naive_average: 73.14
GPQA_diamond_accuracy: 31.06
hellaswag_accuracy: 94.79
TheoremQA_score: 22.25
musr_average_naive_average: 50.89
korbench_single_naive_average: 32.16
ARC_Prize_Public_Evaluation_accuracy: 0.02
gsm8k_accuracy: 86.73
GaokaoBench_weighted_average: 78.6
math_accuracy: 61
cmo_fib_accuracy: 11
aime2024_accuracy: 3.33
Mathbench_naive_average: 64.23
wikibench-wiki-single_choice_cncircular_perf_4: 31.32
cmmlu_naive_average: 74.3
mmlu_naive_average: 70.84
mmlu_pro_naive_average: 44.98
openai_humaneval_humaneval_pass@1: 69.8
sanitized_mbpp_score: 64.4
humanevalx_naive_average: 33.35
ds1000_naive_average: 14.15
lcb_code_generation_pass@1: 17.75
lcb_code_execution_pass@1: 32.57
lcb_test_output_pass@1: 26.13
bigcodebench_hard_instruct_pass@1: 3.38
bigcodebench_hard_complete_pass@1: 5.06
teval_naive_average: 80
SciCode_sub_accuracy: 5.56
qa_dingo_cn_score: 99.01
mmlu-stem_accuracy: 68.2
mmlu-social-science_accuracy: 75.8
mmlu-humanities_accuracy: 69.3
mmlu-other_accuracy: 71.3
cmmlu-stem_accuracy: 66.64
cmmlu-social-science_accuracy: 76
cmmlu-humanities_accuracy: 77.9
cmmlu-other_accuracy: 77.25
cmmlu-china-specific_accuracy: 73.6
mmlu_pro_biology_accuracy: 66.67
mmlu_pro_business_accuracy: 47.91
mmlu_pro_chemistry_accuracy: 35
mmlu_pro_computer_science_accuracy: 48.9
mmlu_pro_economics_accuracy: 55.87
mmlu_pro_engineering_accuracy: 29.62
mmlu_pro_health_accuracy: 45
mmlu_pro_history_accuracy: 40.8
mmlu_pro_law_accuracy: 25.79
mmlu_pro_math_accuracy: 53.48
mmlu_pro_philosophy_accuracy: 38.38
mmlu_pro_physics_accuracy: 37.79
mmlu_pro_psychology_accuracy: 58.39
mmlu_pro_other_accuracy: 46.27
humanevalx-python_pass@1: 53.66
humanevalx-cpp_pass@1: 22.56
humanevalx-go_pass@1: 0
humanevalx-js_pass@1: 54.88
ds1000_Pandas_accuracy: 10.65
ds1000_Numpy_accuracy: 3.63
ds1000_Tensorflow_accuracy: 13.33
ds1000_Scipy_accuracy: 8.96
ds1000_Sklearn_accuracy: 6.96
ds1000_Pytorch_accuracy: 6.62
ds1000_Matplotlib_accuracy: 49.35
openai_mmmlu_lite_AR-XY_accuracy: 17.19
openai_mmmlu_lite_BN-BD_accuracy: 26.78
openai_mmmlu_lite_DE-DE_accuracy: 51.27
openai_mmmlu_lite_ES-LA_accuracy: 56.94
openai_mmmlu_lite_FR-FR_accuracy: 58.22
openai_mmmlu_lite_HI-IN_accuracy: 30.75
openai_mmmlu_lite_ID-ID_accuracy: 50.6
openai_mmmlu_lite_IT-IT_accuracy: 50.6
openai_mmmlu_lite_JA-JP_accuracy: 51.13
openai_mmmlu_lite_KO-KR_accuracy: 45
openai_mmmlu_lite_PT-BR_accuracy: 57.68
openai_mmmlu_lite_SW-KE_accuracy: 32.56
openai_mmmlu_lite_YO-NG_accuracy: 32.42
openai_mmmlu_lite_ZH-CN_accuracy: 65.4
college_naive_average: 19.17
high_naive_average: 46.5
middle_naive_average: 61.34
primary_naive_average: 73.34
arithmetic_naive_average: 61.67
mathbench-a (average)_naive_average: 52.58
college_knowledge_naive_average: 67.1
high_knowledge_naive_average: 70
middle_knowledge_naive_average: 80
primary_knowledge_naive_average: 90.12
mathbench-t (average)_naive_average: 76
subjective:
alignment_bench_v1_1_总分: 5.68
alpaca_eval_total: 25.96
arenahard_score: 17.15
Followbench_naive_average: 0.81
CompassArena_naive_average: 39.49
FoFo_naive_average: 0.38
mtbench101_avg: 8.01
wildbench_average: -10.49
simpleqa_accuracy_given_attempted: 0.04
chinese_simpleqa_given_attempted_accuracy: 0.34
alignment_bench_v1_1_专业能力: 6.05
alignment_bench_v1_1_数学计算: 5.87
alignment_bench_v1_1_基本任务: 6.01
alignment_bench_v1_1_逻辑推理: 4.48
alignment_bench_v1_1_中文理解: 6.17
alignment_bench_v1_1_文本写作: 6.06
alignment_bench_v1_1_角色扮演: 6.3
alignment_bench_v1_1_综合问答: 6.45
alpaca_eval_helpful_base: 17.83
alpaca_eval_koala: 28.21
alpaca_eval_oasst: 23.4
alpaca_eval_selfinstruct: 30.95
alpaca_eval_vicuna: 25.00
compassarena_language_naive_average: 53.00
compassarena_knowledge_naive_average: 36
compassarena_reason_v2_naive_average: 35
compassarena_math_v2_naive_average: 16.07
compassarena_creationv2_zh_naive_average: 43.64
fofo_test_prompts_overall: 0.35
fofo_test_prompts_cn_overall: 0.41
followbench_llmeval_en_HSR_AVG: 0.73
followbench_llmeval_en_SSR_AVG: 0.88
followbench_llmeval_en_HSR_L1: 0.94
followbench_llmeval_en_HSR_L2: 0.77
followbench_llmeval_en_HSR_L3: 0.73
followbench_llmeval_en_HSR_L4: 0.68
followbench_llmeval_en_HSR_L5: 0.54
followbench_llmeval_en_SSR_L1: 0.94
followbench_llmeval_en_SSR_L2: 0.88
followbench_llmeval_en_SSR_L3: 0.87
followbench_llmeval_en_SSR_L4: 0.87
followbench_llmeval_en_SSR_L5: 0.85
simpleqa_f1: 0.04
internlm2_5-7b-chat-1m-turbomind:
long_context:
ruler_8k_naive_average: 88.53
ruler_32k_naive_average: 83.84
ruler_128k_naive_average: 70.94
NeedleBench-Overall-Score-8K_weighted_average: 91.89
NeedleBench-Overall-Score-32K_weighted_average: 91.42
NeedleBench-Overall-Score-128K_weighted_average: 88.57
longbench_naive_average: 46.44
longbench_zh_naive_average: 45.19
longbench_en_naive_average: 45.71
babilong_0k_naive_average: 79.3
babilong_4k_naive_average: 67
babilong_16k_naive_average: 52.7
babilong_32k_naive_average: 48.9
babilong_128k_naive_average: 40.8
babilong_256k_naive_average: 23.5
longbench_single-document-qa_score: 43.56
longbench_multi-document-qa_score: 46.24
longbench_summarization_score: 24.32
longbench_few-shot-learning_score: 51.67
longbench_synthetic-tasks_score: 66.83
longbench_code-completion_score: 45.99
qwen2.5-7b-instruct-turbomind:
objective:
race-high_accuracy: 84.99
ARC-c_accuracy: 92.2
BoolQ_accuracy: 86.7
triviaqa_wiki_1shot_score: 53.06
nq_open_1shot_score: 17.51
mmmlu_lite_naive_average: 54.96
IFEval_Prompt-level-strict-accuracy: 71.53
drop_accuracy: 80.07
bbh_naive_average: 68.81
GPQA_diamond_accuracy: 34.34
hellaswag_accuracy: 85.42
TheoremQA_score: 18.38
musr_average_naive_average: 43.44
korbench_single_naive_average: 39.44
ARC_Prize_Public_Evaluation_accuracy: 0
gsm8k_accuracy: 92.57
GaokaoBench_weighted_average: 80.14
math_accuracy: 73.58
cmo_fib_accuracy: 25
aime2024_accuracy: 16.67
Mathbench_naive_average: 77.33
wikibench-wiki-single_choice_cncircular_perf_4: 34.9
cmmlu_naive_average: 75.97
mmlu_naive_average: 76.01
mmlu_pro_naive_average: 56.12
openai_humaneval_humaneval_pass@1: 83.54
sanitized_mbpp_score: 74.71
humanevalx_naive_average: 48.29
ds1000_naive_average: 18.66
lcb_code_generation_pass@1: 39.5
lcb_code_execution_pass@1: 42.38
lcb_test_output_pass@1: 50.68
bigcodebench_hard_instruct_pass@1: 16.22
bigcodebench_hard_complete_pass@1: 11.49
teval_naive_average: 79.72
SciCode_sub_accuracy: 10.76
qa_dingo_cn_score: 99.01
mmlu_accuracy: 76.01
mmlu-stem_accuracy: 77.59
mmlu-social-science_accuracy: 79.02
mmlu-humanities_accuracy: 72.07
mmlu-other_accuracy: 74.86
cmmlu_accuracy: 75.97
cmmlu-stem_accuracy: 73.09
cmmlu-social-science_accuracy: 75.95
cmmlu-humanities_accuracy: 76.53
cmmlu-other_accuracy: 78.79
cmmlu-china-specific_accuracy: 73.17
mmlu_pro_accuracy: 56.12
mmlu_pro_biology_accuracy: 71.41
mmlu_pro_business_accuracy: 67.68
mmlu_pro_chemistry_accuracy: 54.59
mmlu_pro_computer_science_accuracy: 58.29
mmlu_pro_economics_accuracy: 66.82
mmlu_pro_engineering_accuracy: 42.41
mmlu_pro_health_accuracy: 55.87
mmlu_pro_history_accuracy: 46.46
mmlu_pro_law_accuracy: 28.97
mmlu_pro_math_accuracy: 73.13
mmlu_pro_philosophy_accuracy: 44.89
mmlu_pro_physics_accuracy: 58.43
mmlu_pro_psychology_accuracy: 63.16
mmlu_pro_other_accuracy: 53.57
humanevalx-python_pass@1: 50
humanevalx-cpp_pass@1: 42.07
humanevalx-go_pass@1: 0
humanevalx-java_pass@1: 53.05
humanevalx-js_pass@1: 75
ds1000_Pandas_accuracy: 14.09
ds1000_Numpy_accuracy: 8.18
ds1000_Tensorflow_accuracy: 17.78
ds1000_Scipy_accuracy: 15.09
ds1000_Sklearn_accuracy: 10.43
ds1000_Pytorch_accuracy: 4.41
ds1000_Matplotlib_accuracy: 60.65
mmmlu_lite_accuracy: 54.96
openai_mmmlu_lite_AR-XY_accuracy: 42.32
openai_mmmlu_lite_BN-BD_accuracy: 42.25
openai_mmmlu_lite_DE-DE_accuracy: 59.93
openai_mmmlu_lite_ES-LA_accuracy: 66.53
openai_mmmlu_lite_FR-FR_accuracy: 66.88
openai_mmmlu_lite_HI-IN_accuracy: 49.26
openai_mmmlu_lite_ID-ID_accuracy: 61.26
openai_mmmlu_lite_IT-IT_accuracy: 65.47
openai_mmmlu_lite_JA-JP_accuracy: 61.54
openai_mmmlu_lite_KO-KR_accuracy: 60.28
openai_mmmlu_lite_PT-BR_accuracy: 55.51
openai_mmmlu_lite_SW-KE_accuracy: 36.42
openai_mmmlu_lite_YO-NG_accuracy: 32.14
openai_mmmlu_lite_ZH-CN_accuracy: 69.61
college_naive_average: 44.33
high_naive_average: 59
middle_naive_average: 78
primary_naive_average: 85.67
arithmetic_naive_average: 75.67
mathbench-a (average)_naive_average: 69.27
college_knowledge_naive_average: 83.86
high_knowledge_naive_average: 80.29
middle_knowledge_naive_average: 84.26
primary_knowledge_naive_average: 93.16
mathbench-t (average)_naive_average: 85.39
internlm2_5-7b-chat-pytorch:
objective:
race-high_accuracy: 86.39
ARC-c_accuracy: 90.51
BoolQ_accuracy: 88.01
triviaqa_wiki_1shot_score: 64.77
nq_open_1shot_score: 22.71
mmmlu_lite_naive_average: 45.02
IFEval_Prompt-level-strict-accuracy: 56.56
drop_accuracy: 75.46
bbh_naive_average: 73.34
GPQA_diamond_accuracy: 32.83
hellaswag_accuracy: 94.81
TheoremQA_score: 23.88
musr_average_naive_average: 51.31
korbench_single_naive_average: 32
ARC_Prize_Public_Evaluation_accuracy: 0.01
gsm8k_accuracy: 86.96
GaokaoBench_weighted_average: 78.05
math_accuracy: 60.34
cmo_fib_accuracy: 12.98
aime2024_accuracy: 3.33
Mathbench_naive_average: 64.82
wikibench-wiki-single_choice_cncircular_perf_4: 31.7
cmmlu_naive_average: 74.24
mmlu_naive_average: 70.2
mmlu_pro_naive_average: 45.39
openai_humaneval_humaneval_pass@1: 70.12
sanitized_mbpp_score: 64.59
humanevalx_naive_average: 38.78
ds1000_naive_average: 14.19
lcb_code_generation_pass@1: 16.5
lcb_code_execution_pass@1: 33.82
lcb_test_output_pass@1: 22.62
bigcodebench_hard_instruct_pass@1: 6.08
bigcodebench_hard_complete_pass@1: 6.76
teval_naive_average: 79.73
SciCode_sub_accuracy: 3.47
qa_dingo_cn_score: 100
mmlu_accuracy: 70.2
mmlu-stem_accuracy: 67.73
mmlu-social-science_accuracy: 75.49
mmlu-humanities_accuracy: 68.56
mmlu-other_accuracy: 70.58
cmmlu_accuracy: 74.24
cmmlu-stem_accuracy: 66.7
cmmlu-social-science_accuracy: 75.88
cmmlu-humanities_accuracy: 77.56
cmmlu-other_accuracy: 77.52
cmmlu-china-specific_accuracy: 73.46
mmlu_pro_accuracy: 45.39
mmlu_pro_biology_accuracy: 65.83
mmlu_pro_business_accuracy: 51.96
mmlu_pro_chemistry_accuracy: 36.84
mmlu_pro_computer_science_accuracy: 48.29
mmlu_pro_economics_accuracy: 56.16
mmlu_pro_engineering_accuracy: 29.1
mmlu_pro_health_accuracy: 44.5
mmlu_pro_history_accuracy: 42.26
mmlu_pro_law_accuracy: 24.98
mmlu_pro_math_accuracy: 54.85
mmlu_pro_philosophy_accuracy: 39.28
mmlu_pro_physics_accuracy: 37.41
mmlu_pro_psychology_accuracy: 58.27
mmlu_pro_other_accuracy: 45.78
humanevalx-python_pass@1: 56.1
humanevalx-cpp_pass@1: 20.73
humanevalx-go_pass@1: 0
humanevalx-java_pass@1: 59.15
humanevalx-js_pass@1: 57.93
ds1000_Pandas_accuracy: 8.93
ds1000_Numpy_accuracy: 4.09
ds1000_Tensorflow_accuracy: 11.11
ds1000_Scipy_accuracy: 7.55
ds1000_Sklearn_accuracy: 7.83
ds1000_Pytorch_accuracy: 8.82
ds1000_Matplotlib_accuracy: 50.97
mmmlu_lite_accuracy: 45.02
openai_mmmlu_lite_AR-XY_accuracy: 18.6
openai_mmmlu_lite_BN-BD_accuracy: 27.58
openai_mmmlu_lite_DE-DE_accuracy: 51.23
openai_mmmlu_lite_ES-LA_accuracy: 56.63
openai_mmmlu_lite_FR-FR_accuracy: 58.11
openai_mmmlu_lite_HI-IN_accuracy: 33.82
openai_mmmlu_lite_ID-ID_accuracy: 50.39
openai_mmmlu_lite_IT-IT_accuracy: 50.39
openai_mmmlu_lite_JA-JP_accuracy: 50.95
openai_mmmlu_lite_KO-KR_accuracy: 45.05
openai_mmmlu_lite_PT-BR_accuracy: 57.89
openai_mmmlu_lite_SW-KE_accuracy: 32.14
openai_mmmlu_lite_YO-NG_accuracy: 32.14
openai_mmmlu_lite_ZH-CN_accuracy: 65.33
college_naive_average: 21
high_naive_average: 47
middle_naive_average: 59.67
primary_naive_average: 72.33
arithmetic_naive_average: 62
mathbench-a (average)_naive_average: 53.13
college_knowledge_naive_average: 68.99
high_knowledge_naive_average: 70.06
middle_knowledge_naive_average: 78.53
primary_knowledge_naive_average: 88.49
mathbench-t (average)_naive_average: 76.51
qwen2.5-7b-instruct-pytorch:
objective:
race-high_accuracy: 85.16
ARC-c_accuracy: 90.85
BoolQ_accuracy: 86.61
triviaqa_wiki_1shot_score: 52.96
nq_open_1shot_score: 17.62
mmmlu_lite_naive_average: 54.7
IFEval_Prompt-level-strict-accuracy: 71.35
drop_accuracy: 80.23
bbh_naive_average: 68.88
GPQA_diamond_accuracy: 36.36
hellaswag_accuracy: 85.49
TheoremQA_score: 18.38
musr_average_naive_average: 43.3
korbench_single_naive_average: 39.44
ARC_Prize_Public_Evaluation_accuracy: 0
gsm8k_accuracy: 91.66
GaokaoBench_weighted_average: 80.02
math_accuracy: 73.74
cmo_fib_accuracy: 22.60
aime2024_accuracy: 13.33
Mathbench_naive_average: 77.08
wikibench-wiki-single_choice_cncircular_perf_4: 34
cmmlu_naive_average: 75.9
mmlu_naive_average: 76.27
mmlu_pro_naive_average: 56.14
openai_humaneval_humaneval_pass@1: 84.76
sanitized_mbpp_score: 74.71
humanevalx_naive_average: 48.17
ds1000_naive_average: 18.57
lcb_code_generation_pass@1: 38.75
lcb_code_execution_pass@1: 42.38
lcb_test_output_pass@1: 50.45
bigcodebench_hard_instruct_pass@1: 16.89
bigcodebench_hard_complete_pass@1: 12.16
teval_naive_average: 79.46
SciCode_sub_accuracy: 10.42
qa_dingo_cn_score: 100
mmlu_accuracy: 76.27
mmlu-stem_accuracy: 77.75
mmlu-social-science_accuracy: 78.65
mmlu-humanities_accuracy: 73.12
mmlu-other_accuracy: 75.05
cmmlu_accuracy: 75.9
cmmlu-stem_accuracy: 73.41
cmmlu-social-science_accuracy: 75.97
cmmlu-humanities_accuracy: 76.42
cmmlu-other_accuracy: 78.15
cmmlu-china-specific_accuracy: 73.27
mmlu_pro_accuracy: 56.14
mmlu_pro_biology_accuracy: 72.25
mmlu_pro_business_accuracy: 66.16
mmlu_pro_chemistry_accuracy: 55.65
mmlu_pro_computer_science_accuracy: 60.24
mmlu_pro_economics_accuracy: 66.82
mmlu_pro_engineering_accuracy: 41.38
mmlu_pro_health_accuracy: 54.89
mmlu_pro_history_accuracy: 46.46
mmlu_pro_law_accuracy: 29.06
mmlu_pro_math_accuracy: 73.58
mmlu_pro_philosophy_accuracy: 44.89
mmlu_pro_physics_accuracy: 60.05
mmlu_pro_psychology_accuracy: 61.9
mmlu_pro_other_accuracy: 52.6
humanevalx-python_pass@1: 51.83
humanevalx-cpp_pass@1: 42.68
humanevalx-go_pass@1: 0
humanevalx-java_pass@1: 73.78
humanevalx-js_pass@1: 72.56
ds1000_Pandas_accuracy: 14.09
ds1000_Numpy_accuracy: 8.64
ds1000_Tensorflow_accuracy: 17.78
ds1000_Scipy_accuracy: 15.09
ds1000_Sklearn_accuracy: 8.7
ds1000_Pytorch_accuracy: 4.41
ds1000_Matplotlib_accuracy: 61.29
mmmlu_lite_accuracy: 54.7
openai_mmmlu_lite_AR-XY_accuracy: 42.32
openai_mmmlu_lite_BN-BD_accuracy: 42.18
openai_mmmlu_lite_DE-DE_accuracy: 60
openai_mmmlu_lite_ES-LA_accuracy: 66.18
openai_mmmlu_lite_FR-FR_accuracy: 66.88
openai_mmmlu_lite_HI-IN_accuracy: 48.63
openai_mmmlu_lite_ID-ID_accuracy: 61.26
openai_mmmlu_lite_IT-IT_accuracy: 65.26
openai_mmmlu_lite_JA-JP_accuracy: 60.7
openai_mmmlu_lite_KO-KR_accuracy: 60.63
openai_mmmlu_lite_PT-BR_accuracy: 54.46
openai_mmmlu_lite_SW-KE_accuracy: 36
openai_mmmlu_lite_YO-NG_accuracy: 31.86
openai_mmmlu_lite_ZH-CN_accuracy: 69.4
college_naive_average: 48.33
high_naive_average: 59.33
middle_naive_average: 76.67
primary_naive_average: 86.67
arithmetic_naive_average: 74.33
mathbench-a (average)_naive_average: 69.07
college_knowledge_naive_average: 83.54
high_knowledge_naive_average: 80.82
middle_knowledge_naive_average: 83.79
primary_knowledge_naive_average: 92.22
mathbench-t (average)_naive_average: 85.1
internlm3-8b-instruct-turbomind:
objective:
race-high_accuracy: 89.22
ARC-c_accuracy: 92.54
BoolQ_accuracy: 86.45
triviaqa_wiki_1shot_score: 60.72
nq_open_1shot_score: 20.25
mmmlu_lite_naive_average: 41.82
IFEval_Prompt-level-strict-accuracy: 77.45
drop_accuracy: 83.27
bbh_naive_average: 55.22
GPQA_diamond_accuracy: 37.88
hellaswag_accuracy: 91.28
TheoremQA_score: 20.12
musr_average_naive_average: 36.86
korbench_single_naive_average: 41.2
ARC_Prize_Public_Evaluation_accuracy: 0.06
gsm8k_accuracy: 91.28
GaokaoBench_weighted_average: 86.59
math_accuracy: 76.96
cmo_fib_accuracy: 38.46
aime2024_accuracy: 13.33
Mathbench_naive_average: 78.96
wikibench-wiki-single_choice_cncircular_perf_4: 37.45
cmmlu_naive_average: 83.33
mmlu_naive_average: 76.21
mmlu_pro_naive_average: 57.96
openai_humaneval_humaneval_pass@1: 81.71
sanitized_mbpp_score: 69.65
humanevalx_naive_average: 40.73
ds1000_naive_average: 27.23
lcb_code_generation_pass@1: 34.75
lcb_code_execution_pass@1: 49.9
lcb_test_output_pass@1: 48.19
bigcodebench_hard_instruct_pass@1: 13.51
bigcodebench_hard_complete_pass@1: 15.54
teval_naive_average: 82.86
SciCode_sub_accuracy: 11.11
qa_dingo_cn_score: 100
mmlu_accuracy: 76.21
mmlu-stem_accuracy: 77.7
mmlu-social-science_accuracy: 80.98
mmlu-humanities_accuracy: 70.83
mmlu-other_accuracy: 75.01
cmmlu_accuracy: 83.33
cmmlu-stem_accuracy: 79.66
cmmlu-social-science_accuracy: 83.39
cmmlu-humanities_accuracy: 84.73
cmmlu-other_accuracy: 86.2
cmmlu-china-specific_accuracy: 81.77
mmlu_pro_accuracy: 57.96
mmlu_pro_biology_accuracy: 75.45
mmlu_pro_business_accuracy: 64.64
mmlu_pro_chemistry_accuracy: 59.81
mmlu_pro_computer_science_accuracy: 60.24
mmlu_pro_economics_accuracy: 68.6
mmlu_pro_engineering_accuracy: 44.79
mmlu_pro_health_accuracy: 58.31
mmlu_pro_history_accuracy: 49.87
mmlu_pro_law_accuracy: 32.43
mmlu_pro_math_accuracy: 70.17
mmlu_pro_philosophy_accuracy: 46.89
mmlu_pro_physics_accuracy: 59.58
mmlu_pro_psychology_accuracy: 66.29
mmlu_pro_other_accuracy: 54.33
humanevalx-python_pass@1: 43.9
humanevalx-cpp_pass@1: 20.12
humanevalx-go_pass@1: 0
humanevalx-java_pass@1: 40.85
humanevalx-js_pass@1: 65.24
ds1000_Pandas_accuracy: 16.49
ds1000_Numpy_accuracy: 34.09
ds1000_Tensorflow_accuracy: 26.67
ds1000_Scipy_accuracy: 17.92
ds1000_Sklearn_accuracy: 20.87
ds1000_Pytorch_accuracy: 19.12
ds1000_Matplotlib_accuracy: 55.48
mmmlu_lite_accuracy: 41.82
openai_mmmlu_lite_AR-XY_accuracy: 32.56
openai_mmmlu_lite_BN-BD_accuracy: 4.56
openai_mmmlu_lite_DE-DE_accuracy: 24.91
openai_mmmlu_lite_ES-LA_accuracy: 51.09
openai_mmmlu_lite_FR-FR_accuracy: 61.68
openai_mmmlu_lite_HI-IN_accuracy: 24.98
openai_mmmlu_lite_ID-ID_accuracy: 44.56
openai_mmmlu_lite_IT-IT_accuracy: 52.35
openai_mmmlu_lite_JA-JP_accuracy: 51.02
openai_mmmlu_lite_KO-KR_accuracy: 47.93
openai_mmmlu_lite_PT-BR_accuracy: 53.89
openai_mmmlu_lite_SW-KE_accuracy: 33.47
openai_mmmlu_lite_YO-NG_accuracy: 33.47
openai_mmmlu_lite_ZH-CN_accuracy: 69.05
college_naive_average: 45.67
high_naive_average: 64.67
middle_naive_average: 82.33
primary_naive_average: 90.33
arithmetic_naive_average: 74
mathbench-a (average)_naive_average: 71.4
college_knowledge_naive_average: 85.28
high_knowledge_naive_average: 79.43
middle_knowledge_naive_average: 87.9
primary_knowledge_naive_average: 93.42
mathbench-t (average)_naive_average: 86.51
internlm3-8b-instruct-pytorch:
objective:
race-high_accuracy: 89.02
ARC-c_accuracy: 93.56
BoolQ_accuracy: 86.67
triviaqa_wiki_1shot_score: 60.54
nq_open_1shot_score: 20.3
mmmlu_lite_naive_average: 42.6
IFEval_Prompt-level-strict-accuracy: 79.11
drop_accuracy: 83.32
bbh_naive_average: 54.76
GPQA_diamond_accuracy: 33.84
hellaswag_accuracy: 91.31
TheoremQA_score: 18
musr_average_naive_average: 36.62
korbench_single_naive_average: 41.84
ARC_Prize_Public_Evaluation_accuracy: 0.06
gsm8k_accuracy: 90.67
GaokaoBench_weighted_average: 86.27
math_accuracy: 76.68
cmo_fib_accuracy: 33.65
aime2024_accuracy: 10
Mathbench_naive_average: 78.92
wikibench-wiki-single_choice_cncircular_perf_4: 37.35
cmmlu_naive_average: 83.11
mmlu_naive_average: 76.23
mmlu_pro_naive_average: 58.16
openai_humaneval_humaneval_pass@1: 82.32
sanitized_mbpp_score: 70.04
humanevalx_naive_average: 25.49
ds1000_naive_average: 27.84
lcb_code_generation_pass@1: 34.5
lcb_code_execution_pass@1: 48.02
lcb_test_output_pass@1: 47.74
bigcodebench_hard_instruct_pass@1: 12.84
bigcodebench_hard_complete_pass@1: 15.54
teval_naive_average: 82.86
SciCode_sub_accuracy: 9.38
qa_dingo_cn_score: 100
mmlu_accuracy: 76.23
mmlu-stem_accuracy: 78.08
mmlu-social-science_accuracy: 80.31
mmlu-humanities_accuracy: 71.38
mmlu-other_accuracy: 74.63
cmmlu_accuracy: 83.11
cmmlu-stem_accuracy: 79.42
cmmlu-social-science_accuracy: 83.34
cmmlu-humanities_accuracy: 83.95
cmmlu-other_accuracy: 86.22
cmmlu-china-specific_accuracy: 81.5
mmlu_pro_accuracy: 58.16
mmlu_pro_biology_accuracy: 74.62
mmlu_pro_business_accuracy: 65.02
mmlu_pro_chemistry_accuracy: 60.69
mmlu_pro_computer_science_accuracy: 61.46
mmlu_pro_economics_accuracy: 68.25
mmlu_pro_engineering_accuracy: 45.3
mmlu_pro_health_accuracy: 60.15
mmlu_pro_history_accuracy: 50.66
mmlu_pro_law_accuracy: 31.7
mmlu_pro_math_accuracy: 70.32
mmlu_pro_philosophy_accuracy: 47.7
mmlu_pro_physics_accuracy: 59.51
mmlu_pro_psychology_accuracy: 65.41
mmlu_pro_other_accuracy: 53.46
humanevalx-python_pass@1: 42.68
humanevalx-cpp_pass@1: 19.51
humanevalx-go_pass@1: 0
humanevalx-java_pass@1: 0.00
humanevalx-js_pass@1: 64.02
ds1000_Pandas_accuracy: 14.09
ds1000_Numpy_accuracy: 35
ds1000_Tensorflow_accuracy: 24.44
ds1000_Scipy_accuracy: 20.75
ds1000_Sklearn_accuracy: 21.74
ds1000_Pytorch_accuracy: 22.06
ds1000_Matplotlib_accuracy: 56.77
mmmlu_lite_accuracy: 42.6
openai_mmmlu_lite_AR-XY_accuracy: 32.84
openai_mmmlu_lite_BN-BD_accuracy: 10.46
openai_mmmlu_lite_DE-DE_accuracy: 24.56
openai_mmmlu_lite_ES-LA_accuracy: 50.95
openai_mmmlu_lite_FR-FR_accuracy: 61.05
openai_mmmlu_lite_HI-IN_accuracy: 30.6
openai_mmmlu_lite_ID-ID_accuracy: 45.89
openai_mmmlu_lite_IT-IT_accuracy: 51.79
openai_mmmlu_lite_JA-JP_accuracy: 51.65
openai_mmmlu_lite_KO-KR_accuracy: 48.77
openai_mmmlu_lite_PT-BR_accuracy: 52.7
openai_mmmlu_lite_SW-KE_accuracy: 32.91
openai_mmmlu_lite_YO-NG_accuracy: 32.84
openai_mmmlu_lite_ZH-CN_accuracy: 69.33
college_naive_average: 47
high_naive_average: 66.67
middle_naive_average: 81.67
primary_naive_average: 89.33
arithmetic_naive_average: 73.67
mathbench-a (average)_naive_average: 71.67
college_knowledge_naive_average: 82.91
high_knowledge_naive_average: 79.86
middle_knowledge_naive_average: 88.92
primary_knowledge_naive_average: 92.96
mathbench-t (average)_naive_average: 86.16

View File

@ -0,0 +1,432 @@
chat:
glm-4-9b-chat-hf:
gsm8k_accuracy: 56.25
race-high_accuracy: 84.38
glm-4-9b-chat-turbomind:
gsm8k_accuracy: 71.88
race-high_accuracy: 90.62
glm-4-9b-chat-vllm:
gsm8k_accuracy: 71.88
race-high_accuracy: 90.62
deepseek-7b-chat-hf:
gsm8k_accuracy: 46.88
race-high_accuracy: 81.25
deepseek-r1-distill-llama-8b-turbomind:
gsm8k_accuracy: 34.38
race-high_accuracy: 81.25
deepseek-r1-distill-qwen-1_5b-turbomind:
gsm8k_accuracy: 28.12
race-high_accuracy: 53.12
deepseek-7b-chat-vllm:
gsm8k_accuracy: 56.25
race-high_accuracy: 78.12
gemma2-2b-it-hf:
gsm8k_accuracy: 50
race-high_accuracy: 75
gemma2-9b-it-hf:
gsm8k_accuracy: 68.75
race-high_accuracy: 84.38
gemma-2b-it-hf:
gsm8k_accuracy: 3.12
race-high_accuracy: 40.62
gemma-7b-it-hf:
gsm8k_accuracy: 40.62
race-high_accuracy: 68.75
gemma-2-9b-it-turbomind:
gsm8k_accuracy: 68.75
race-high_accuracy: 84.38
gemma-2-27b-it-turbomind:
gsm8k_accuracy: 78.12
race-high_accuracy: 93.75
gemma-7b-it-vllm:
gsm8k_accuracy: 28.12
race-high_accuracy: 68.75
internlm2_5-7b-chat-hf:
gsm8k_accuracy: 84.38
race-high_accuracy: 90.62
internlm3-8b-instruct-hf:
gsm8k_accuracy: 65.62
race-high_accuracy: 87.5
internlm2_5-7b-chat-turbomind:
gsm8k_accuracy: 81.25
race-high_accuracy: 90.62
internlm2-chat-1.8b-turbomind:
gsm8k_accuracy: 25.00
race-high_accuracy: 84.38
internlm2-chat-1.8b-sft-turbomind:
gsm8k_accuracy: 34.38
race-high_accuracy: 84.38
internlm2-chat-7b-lmdeploy:
gsm8k_accuracy: 59.38
race-high_accuracy: 87.50
internlm2-chat-7b-sft-turbomind:
gsm8k_accuracy: 56.25
race-high_accuracy: 87.50
internlm3-8b-instruct-turbomind:
gsm8k_accuracy: 65.62
race-high_accuracy: 87.5
internlm2-chat-7b-vllm:
gsm8k_accuracy: 53.12
race-high_accuracy: 87.50
llama-3_1-8b-instruct-hf:
gsm8k_accuracy: 84.38
race-high_accuracy: 90.62
llama-3_2-3b-instruct-hf:
gsm8k_accuracy: 71.88
race-high_accuracy: 81.25
llama-3-8b-instruct-hf:
gsm8k_accuracy: 68.75
race-high_accuracy: 87.5
llama-2-7b-chat-turbomind:
gsm8k_accuracy: 18.75
race-high_accuracy: 46.88
llama-3_1-8b-instruct-turbomind:
gsm8k_accuracy: 84.38
race-high_accuracy: 90.62
llama-3_2-3b-instruct-turbomind:
gsm8k_accuracy: 65.62
race-high_accuracy: 81.25
llama-3-8b-instruct-turbomind:
gsm8k_accuracy: 65.62
race-high_accuracy: 84.38
mistral-7b-instruct-v0.2-hf:
gsm8k_accuracy: 40.62
race-high_accuracy: 75
mistral-7b-instruct-v0.3-hf:
gsm8k_accuracy: 40.62
race-high_accuracy: 75
mistral-nemo-instruct-2407-hf:
gsm8k_accuracy: 75
race-high_accuracy: 81.25
mistral-nemo-instruct-2407-turbomind:
gsm8k_accuracy: 71.88
race-high_accuracy: 75
mistral-7b-instruct-v0.1-vllm:
gsm8k_accuracy: 34.38
race-high_accuracy: 65.62
mistral-7b-instruct-v0.2-vllm:
gsm8k_accuracy: 28.12
race-high_accuracy: 78.12
qwen2.5-0.5b-instruct-hf:
gsm8k_accuracy: 34.38
race-high_accuracy: 46.88
qwen2.5-3b-instruct-hf :
gsm8k_accuracy: 53.12
race-high_accuracy: 90.62
qwen2.5-0.5b-instruct-turbomind:
gsm8k_accuracy: 28.12
race-high_accuracy: 43.75
qwen2.5-3b-instruct-turbomind:
gsm8k_accuracy: 56.25
race-high_accuracy: 90.62
qwen1.5-0.5b-chat-hf:
gsm8k_accuracy: 0
race-high_accuracy: 53.12
qwen2-1.5b-instruct-hf:
gsm8k_accuracy: 62.5
race-high_accuracy: 84.38
qwen2-7b-instruct-hf:
gsm8k_accuracy: 68.75
race-high_accuracy: 90.62
qwen2-1.5b-instruct-turbomind:
gsm8k_accuracy: 56.25
race-high_accuracy: 84.38
qwen2-7b-instruct-turbomind:
gsm8k_accuracy: 75.00
race-high_accuracy: 87.50
qwen1.5-0.5b-chat-vllm:
gsm8k_accuracy: 6.25
race-high_accuracy: 53.12
yi-1.5-6b-chat-hf:
gsm8k_accuracy: 65.62
race-high_accuracy: 84.38
yi-1.5-9b-chat-hf:
gsm8k_accuracy: 75
race-high_accuracy: 93.75
yi-1.5-6b-chat-turbomind:
gsm8k_accuracy: 59.38
race-high_accuracy: 84.38
yi-1.5-9b-chat-turbomind:
gsm8k_accuracy: 78.12
race-high_accuracy: 93.75
deepseek-v2_lite-chat-turbomind:
gsm8k_accuracy: 43.75
race-high_accuracy: 71.88
gemma2-27b-it-hf:
gsm8k_accuracy: 71.88
race-high_accuracy: 93.75
internlm2_5-20b-chat-hf:
gsm8k_accuracy: 84.38
race-high_accuracy: 87.5
internlm2_5-20b-chat-turbomind:
gsm8k_accuracy: 87.50
race-high_accuracy: 87.5
mistral-small-instruct-2409-hf:
gsm8k_accuracy: 81.25
race-high_accuracy: 87.50
mistral-small-instruct-2409-turbomind:
gsm8k_accuracy: 78.12
race-high_accuracy: 87.50
phi-4:
gsm8k_accuracy: 81.25
race-high_accuracy: 87.50
qwen2.5-14b-instruct-hf:
gsm8k_accuracy: 71.88
race-high_accuracy: 96.88
qwen2.5-14b-instruct-turbomind:
gsm8k_accuracy: 71.88
race-high_accuracy: 96.88
yi-1.5-34b-chat-turbomind:
gsm8k_accuracy: 71.88
race-high_accuracy: 93.75
deepseek-67b-chat-turbomind:
gsm8k_accuracy: 71.88
race-high_accuracy: 75.00
deepseek-r1-distill-qwen-32b-turbomind:
gsm8k_accuracy: 31.25
race-high_accuracy: 90.62
llama-3_3-70b-instruct-turbomind:
gsm8k_accuracy: 93.75
race-high_accuracy: 87.5
mixtral-large-instruct-2411-turbomind:
gsm8k_accuracy: 87.50
race-high_accuracy: 93.75
nvidia-3_1-Nemotron-70b-instruct-HF-turbomind:
gsm8k_accuracy: 90.62
race-high_accuracy: 53.12
qwen2.5-72b-instruct-turbomind:
gsm8k_accuracy: 78.12
race-high_accuracy: 90.62
deepseek-r1-distill-llama-70b-turbomind:
gsm8k_accuracy: 50.00
race-high_accuracy: 87.50
deepseek-v2_5-1210-turbomind:
gsm8k_accuracy: 90.62
race-high_accuracy: 84.38
mixtral-8x22b-instruct-v0.1-turbomind:
gsm8k_accuracy: 75.00
race-high_accuracy: 78.12
mixtral-8x22b-instruct-v0.1-vllm:
gsm8k_accuracy: 78.12
race-high_accuracy: 78.12
base:
glm-4-9b-turbomind:
gsm8k_accuracy: 59.38
GPQA_diamond_accuracy: 28.12
race-high_accuracy: 93.75
winogrande_accuracy: 84.38
deepseek-7b-base-hf:
gsm8k_accuracy: 25
GPQA_diamond_accuracy: 0
race-high_accuracy: 46.88
winogrande_accuracy: 71.88
deepseek-7b-base-turbomind:
gsm8k_accuracy: 18.75
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 50.00
winogrande_accuracy: 84.38
deepseek-moe-16b-base-vllm:
gsm8k_accuracy: 25.00
GPQA_diamond_accuracy: 0
race-high_accuracy: 25
winogrande_accuracy: 68.75
gemma2-2b-hf:
gsm8k_accuracy: 31.25
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 56.25
winogrande_accuracy: 75.00
gemma2-9b-hf:
gsm8k_accuracy: 75.00
GPQA_diamond_accuracy: 0
race-high_accuracy: 84.38
winogrande_accuracy: 81.25
gemma-2b-hf:
gsm8k_accuracy: 21.88
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 21.88
winogrande_accuracy: 53.12
gemma-7b-hf:
gsm8k_accuracy: 56.25
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 65.62
winogrande_accuracy: 71.88
gemma-2-9b-turbomind:
gsm8k_accuracy: 68.75
GPQA_diamond_accuracy: 0
race-high_accuracy: 84.38
winogrande_accuracy: 81.25
gemma-2b-vllm:
gsm8k_accuracy: 15.62
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 28.12
winogrande_accuracy: 68.75
gemma-7b-vllm:
gsm8k_accuracy: 59.38
GPQA_diamond_accuracy: 6.25
race-high_accuracy: 81.25
winogrande_accuracy: 81.25
internlm2_5-7b-hf:
gsm8k_accuracy: 37.5
GPQA_diamond_accuracy: 25
race-high_accuracy: 93.75
winogrande_accuracy: 71.88
internlm2-7b-hf:
gsm8k_accuracy: 53.12
GPQA_diamond_accuracy: 18.75
race-high_accuracy: 62.5
winogrande_accuracy: 78.12
internlm2-1.8b-turbomind:
gsm8k_accuracy: 12.50
GPQA_diamond_accuracy: 9.38
race-high_accuracy: 71.88
winogrande_accuracy: 75
internlm2_5-7b-turbomind:
gsm8k_accuracy: 62.5
GPQA_diamond_accuracy: 31.25
race-high_accuracy: 93.75
winogrande_accuracy: 87.5
internlm2-7b-turbomind:
gsm8k_accuracy: 53.12
GPQA_diamond_accuracy: 25.00
race-high_accuracy: 78.12
winogrande_accuracy: 71.88
internlm2-base-7b-turbomind:
gsm8k_accuracy: 25.00
GPQA_diamond_accuracy: 34.38
race-high_accuracy: 71.88
winogrande_accuracy: 62.50
llama-2-7b-hf:
gsm8k_accuracy: 21.88
GPQA_diamond_accuracy: 21.88
race-high_accuracy: 40.62
winogrande_accuracy: 71.88
llama-3_1-8b-hf:
gsm8k_accuracy: 78.12
GPQA_diamond_accuracy: 25
race-high_accuracy: 90.62
winogrande_accuracy: 62.5
llama-3-8b-hf:
gsm8k_accuracy: 46.88
GPQA_diamond_accuracy: 6.25
race-high_accuracy: 65.62
winogrande_accuracy: 65.62
llama-3.1-8b-turbomind:
gsm8k_accuracy: 56.25
GPQA_diamond_accuracy: 9.38
race-high_accuracy: 78.12
winogrande_accuracy: 78.12
llama-3-8b-turbomind:
gsm8k_accuracy: 46.88
GPQA_diamond_accuracy: 12.50
race-high_accuracy: 65.62
winogrande_accuracy: 81.25
mistral-7b-v0.3-hf:
gsm8k_accuracy: 31.25
GPQA_diamond_accuracy: 6.25
race-high_accuracy: 62.5
winogrande_accuracy: 59.38
qwen2.5-7b-hf:
gsm8k_accuracy: 81.25
GPQA_diamond_accuracy: 18.75
race-high_accuracy: 87.5
winogrande_accuracy: 71.88
qwen2.5-1.5b-turbomind:
gsm8k_accuracy: 59.38
GPQA_diamond_accuracy: 21.88
race-high_accuracy: 78.12
winogrande_accuracy: 71.88
qwen2.5-7b-turbomind:
gsm8k_accuracy: 78.12
GPQA_diamond_accuracy: 21.88
race-high_accuracy: 87.5
winogrande_accuracy: 75.00
qwen1.5-moe-a2.7b-hf:
gsm8k_accuracy: 62.5
GPQA_diamond_accuracy: 18.75
race-high_accuracy: 84.38
winogrande_accuracy: 75
qwen2-0.5b-hf:
gsm8k_accuracy: 25
GPQA_diamond_accuracy: 0
race-high_accuracy: 40.62
winogrande_accuracy: 62.5
qwen2-1.5b-hf:
gsm8k_accuracy: 59.38
GPQA_diamond_accuracy: 9.38
race-high_accuracy: 81.25
winogrande_accuracy: 62.5
qwen2-7b-hf:
gsm8k_accuracy: 68.75
GPQA_diamond_accuracy: 9.38
race-high_accuracy: 87.5
winogrande_accuracy: 68.75
qwen2-1.5b-turbomind:
gsm8k_accuracy: 56.25
GPQA_diamond_accuracy: 12.50
race-high_accuracy: 81.25
winogrande_accuracy: 75
qwen2-7b-turbomind:
gsm8k_accuracy: 65.62
GPQA_diamond_accuracy: 12.5
race-high_accuracy: 87.5
winogrande_accuracy: 75
qwen1.5-0.5b-vllm:
gsm8k_accuracy: 9.38
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 56.25
winogrande_accuracy: 59.38
yi-1.5-6b-hf:
gsm8k_accuracy: 62.5
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 87.5
winogrande_accuracy: 62.5
yi-1.5-9b-hf:
gsm8k_accuracy: 75
GPQA_diamond_accuracy: 40.62
race-high_accuracy: 87.5
winogrande_accuracy: 59.38
yi-1.5-9b-turbomind:
gsm8k_accuracy: 75.00
GPQA_diamond_accuracy: 40.62
race-high_accuracy: 87.5
winogrande_accuracy: 65.62
internlm2-20b-turbomind:
gsm8k_accuracy: 71.88
GPQA_diamond_accuracy: 18.75
race-high_accuracy: 68.75
winogrande_accuracy: 81.25
qwen2.5-14b-hf:
gsm8k_accuracy: 75
GPQA_diamond_accuracy: 37.5
race-high_accuracy: 93.75
winogrande_accuracy: 84.38
qwen2.5-32b-hf:
gsm8k_accuracy: 87.5
GPQA_diamond_accuracy: 31.25
race-high_accuracy: 93.75
winogrande_accuracy: 78.12
qwen2.5-32b-turbomind:
gsm8k_accuracy: 90.62
GPQA_diamond_accuracy: 31.25
race-high_accuracy: 93.75
winogrande_accuracy: 81.25
deepseek-67b-base-turbomind:
gsm8k_accuracy: 62.50
GPQA_diamond_accuracy: 31.25
race-high_accuracy: 78.12
winogrande_accuracy: 81.25
llama-3-70b-turbomind:
gsm8k_accuracy: 56.25
GPQA_diamond_accuracy: 15.62
race-high_accuracy: 93.75
winogrande_accuracy: 84.38
qwen2.5-72b-turbomind:
gsm8k_accuracy: 84.38
GPQA_diamond_accuracy: 40.62
race-high_accuracy: 93.75
winogrande_accuracy: 87.5
deepseek-v2-turbomind:
gsm8k_accuracy: 65.62
GPQA_diamond_accuracy: 3.12
race-high_accuracy: 93.75
winogrande_accuracy: 81.25

View File

@ -13,32 +13,57 @@ on:
description: 'Set branch or tag or commit id. Default is "main"'
type: string
default: 'main'
regression_func:
build_lmdeploy:
required: false
description: 'whether to build lmdeploy'
type: boolean
default: false
repo_org_lmdeploy:
required: false
description: 'Tested repository organization name. Default is internlm/lmdeploy'
type: string
default: 'InternLM/lmdeploy'
repo_ref_lmdeploy:
required: false
description: 'Set branch or tag or commit id. Default is "main"'
type: string
default: 'main'
regression_func_volc:
required: true
description: 'regression functions'
type: string
default: "['chat','base','cmd']"
default: "['chat_models','base_models', 'chat_obj_fullbench', 'base_fullbench']"
regression_func_local:
required: true
description: 'regression functions'
type: string
default: "['cmd', 'api', 'chat_sub_fullbench']"
fullbench_eval:
required: true
description: 'fullbench volc functions'
type: string
default: "['base_objective','chat_objective','chat_subjective','base_long_context','chat_long_context']"
schedule:
- cron: '56 16 * * *'
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
- cron: '15 14 * * 0,3'
env:
CONDA_ENV: opencompass_regression
PIP_CACHE_PATH: /cpfs01/user/qa-llm-cicd/.cache/pip
HF_CACHE_PATH: /cpfs01/shared/public/public_hdd/llmeval/model_weights/hf_hub
HUGGINGFACE_HUB_CACHE: /cpfs01/shared/public/public_hdd/llmeval/model_weights/hf_hub
HF_HUB_CACHE: /cpfs01/shared/public/public_hdd/llmeval/model_weights/hf_hub
DATEASET_CACHE_PATH: /cpfs01/shared/public/public_hdd/llmeval/llm-evaluation-datasets
HF_DATASETS_OFFLINE: 1
HF_EVALUATE_OFFLINE: 1
TRANSFORMERS_OFFLINE: 1
VLLM_USE_MODELSCOPE: false
LMDEPLOY_USE_MODELSCOPE: false
HF_HUB_OFFLINE: 1
TRITON_PTXAS_PATH: /usr/local/cuda/bin/ptxas
OUTPUT_FOLDER: cuda12.1_dist_${{ github.run_id }}
CONDA_PATH: ${{ secrets.WORKSPACE_PREFIX }}/miniconda3
PIP_CACHE_PATH: ${{ secrets.WORKSPACE_PREFIX }}/.cache/pip
REPORT_ROOT: ${{ secrets.WORKSPACE_PREFIX }}/eval_report/regression
COMPASS_DATA_CACHE: ${{ secrets.SHARESPACE_PREFIX }}/datasets/compass_data_cache
HUGGINGFACE_HUB_CACHE: ${{ secrets.SHARESPACE_PREFIX }}/models/opencompass_hf_hub
HF_HUB_CACHE: ${{ secrets.SHARESPACE_PREFIX }}/models/opencompass_hf_hub
HF_DATASETS_CACHE: ${{ secrets.SHARESPACE_PREFIX }}/datasets/hf_datasets_cache
HF_ENDPOINT: https://hf-mirror.com
CONDA_ENV: regression_test
export VLLM_WORKER_MULTIPROC_METHOD: spawn
jobs:
build-pypi:
@ -48,10 +73,10 @@ jobs:
with:
repository: ${{ github.event.inputs.repo_org || 'open-compass/opencompass' }}
ref: ${{github.event.inputs.repo_ref || 'main'}}
- name: Set up Python 3.x
uses: actions/setup-python@v2
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: 3.x
python-version: '3.10'
- name: Build lagent
run: |
pip install wheel setuptools
@ -64,16 +89,46 @@ jobs:
retention-days: 1
name: my-artifact-${{ github.run_id }}
daily_run_test:
if: ${{!cancelled()}}
needs: build-pypi
build-pypi-lmdeploy:
if: ${{!cancelled() && (github.event_name == 'schedule' || inputs.build_lmdeploy)}}
strategy:
fail-fast: false
matrix:
cuda_env: [dsw_cu11, dsw_cu12]
runs-on: ${{ matrix.cuda_env }}
environment: 'prod'
timeout-minutes: 600 #10hours
pyver: [py310]
runs-on: ubuntu-latest
env:
PYTHON_VERSION: ${{ matrix.pyver }}
PLAT_NAME: manylinux2014_x86_64
DOCKER_TAG: cuda12.1
steps:
- name: Checkout repository
uses: actions/checkout@v3
with:
repository: ${{ github.event.inputs.repo_org_lmdeploy || 'InternLM/lmdeploy' }}
ref: ${{github.event.inputs.repo_ref_lmdeploy || 'main'}}
- name: Build
run: |
echo ${PYTHON_VERSION}
echo ${PLAT_NAME}
echo ${DOCKER_TAG}
echo ${OUTPUT_FOLDER}
echo ${GITHUB_RUN_ID}
# remove -it
sed -i 's/docker run --rm -it/docker run --rm/g' builder/manywheel/build_wheel.sh
bash builder/manywheel/build_wheel.sh ${PYTHON_VERSION} ${PLAT_NAME} ${DOCKER_TAG} ${OUTPUT_FOLDER}
- name: Upload Artifacts
uses: actions/upload-artifact@v4
with:
if-no-files-found: error
path: builder/manywheel/${{ env.OUTPUT_FOLDER }}
retention-days: 1
name: my-artifact-${{ github.run_id }}-${{ matrix.pyver }}
prepare_env:
if: ${{!cancelled()}}
needs: ['build-pypi', 'build-pypi-lmdeploy']
runs-on: volc_cu12
timeout-minutes: 120 #2hours
steps:
- name: Clone repository
uses: actions/checkout@v2
@ -84,94 +139,210 @@ jobs:
uses: actions/download-artifact@v4
with:
name: my-artifact-${{ github.run_id }}
- name: Prepare - create conda env and install torch - cu11
if: ${{matrix.cuda_env == 'dsw_cu11'}}
run: |
. /cpfs01/shared/public/qa-llm-cicd/miniconda3/bin/activate
conda create -y --name ${{env.CONDA_ENV}}_${{ matrix.cuda_env }} python=3.10
conda activate ${{env.CONDA_ENV}}_${{ matrix.cuda_env }}
pip install -r /cpfs01/shared/public/qa-llm-cicd/requirements-cu11.txt --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass*.whl --cache-dir ${{env.PIP_CACHE_PATH}}
pip install /cpfs01/user/qa-llm-cicd/packages/lmdeploy-0.6.1+cu118-cp310-cp310-manylinux2014_x86_64.whl --cache-dir ${{env.PIP_CACHE_PATH}}
pip install /cpfs01/user/qa-llm-cicd/packages/vllm-0.6.1.post1+cu118-cp310-cp310-manylinux1_x86_64.whl --cache-dir ${{env.PIP_CACHE_PATH}}
pip uninstall torch torchvision torchaudio -y
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --cache-dir ${{env.PIP_CACHE_PATH}} --index-url https://download.pytorch.org/whl/cu118
FLASH_ATTENTION_FORCE_BUILD=TRUE pip install /cpfs01/user/qa-llm-cicd/packages/flash_attn-2.6.3+cu118torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install /cpfs01/user/qa-llm-cicd/packages/xformers-0.0.27.post2+cu118-cp310-cp310-manylinux2014_x86_64.whl --cache-dir ${{env.PIP_CACHE_PATH}}
conda info --envs
pip list
- name: Prepare - create conda env and install torch - cu12
if: ${{matrix.cuda_env == 'dsw_cu12'}}
run: |
. /cpfs01/shared/public/qa-llm-cicd/miniconda3/bin/activate
conda create -y --name ${{env.CONDA_ENV}}_${{ matrix.cuda_env }} python=3.10
conda activate ${{env.CONDA_ENV}}_${{ matrix.cuda_env }}
pip install -r /cpfs01/shared/public/qa-llm-cicd/requirements-cu12.txt --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass*.whl --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass[lmdeploy] --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass[vllm] --cache-dir ${{env.PIP_CACHE_PATH}}
pip uninstall torch torchvision torchaudio -y
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --cache-dir ${{env.PIP_CACHE_PATH}}
FLASH_ATTENTION_FORCE_BUILD=TRUE pip install /cpfs01/user/qa-llm-cicd/packages/flash_attn-2.6.3+cu123torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install /cpfs01/user/qa-llm-cicd/packages/xformers-0.0.27.post2-cp310-cp310-manylinux2014_x86_64.whl --cache-dir ${{env.PIP_CACHE_PATH}}
conda info --envs
pip list
- name: Prepare - prepare data and hf model
run: |
ln -s ${{env.DATEASET_CACHE_PATH}} data
rm -rf ~/.cache/huggingface/hub -f && mkdir ~/.cache -p && mkdir ~/.cache/huggingface -p
ln -s ${{env.HF_CACHE_PATH}} ~/.cache/huggingface/hub
- name: Run command testcase
if: github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'cmd')
run: |
. /cpfs01/shared/public/qa-llm-cicd/miniconda3/bin/activate
conda activate ${{env.CONDA_ENV}}_${{ matrix.cuda_env }}
conda info --envs
export from_tf=TRUE
python tools/list_configs.py internlm2_5 mmlu
opencompass --models hf_internlm2_5_7b --datasets race_ppl --work-dir /cpfs01/user/qa-llm-cicd/report/${{ github.run_id }}/cmd1_${{ matrix.cuda_env }} --reuse --max-num-workers 2
rm regression_result_daily -f && ln -s /cpfs01/user/qa-llm-cicd/report/${{ github.run_id }}/cmd1_${{ matrix.cuda_env }}/*/summary regression_result_daily
python -m pytest -m case1 -s -v --color=yes .github/scripts/oc_score_assert.py
opencompass --models hf_internlm2_5_7b_chat --datasets race_gen -a lmdeploy --work-dir /cpfs01/user/qa-llm-cicd/report/${{ github.run_id }}/cmd2_${{ matrix.cuda_env }} --reuse --max-num-workers 2
rm regression_result_daily -f && ln -s /cpfs01/user/qa-llm-cicd/report/${{ github.run_id }}/cmd2_${{ matrix.cuda_env }}/*/summary regression_result_daily
python -m pytest -m case2 -s -v --color=yes .github/scripts/oc_score_assert.py
opencompass --datasets race_ppl --hf-type base --hf-path internlm/internlm2_5-7b --work-dir /cpfs01/user/qa-llm-cicd/report/${{ github.run_id }}/cmd3_${{ matrix.cuda_env }} --reuse --max-num-workers 2
rm regression_result_daily -f && ln -s /cpfs01/user/qa-llm-cicd/report/${{ github.run_id }}/cmd3_${{ matrix.cuda_env }}/*/summary regression_result_daily
python -m pytest -m case3 -s -v --color=yes .github/scripts/oc_score_assert.py
opencompass --datasets race_gen --hf-type chat --hf-path internlm/internlm2_5-7b-chat --work-dir /cpfs01/user/qa-llm-cicd/report/${{ github.run_id }}/cmd4_${{ matrix.cuda_env }} --reuse --max-num-workers 2
rm regression_result_daily -f && ln -s /cpfs01/user/qa-llm-cicd/report/${{ github.run_id }}/cmd4_${{ matrix.cuda_env }}/*/summary regression_result_daily
python -m pytest -m case4 -s -v --color=yes .github/scripts/oc_score_assert.py
- name: Run chat model test
if: github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'chat')
run: |
. /cpfs01/shared/public/qa-llm-cicd/miniconda3/bin/activate
conda activate ${{env.CONDA_ENV}}_${{ matrix.cuda_env }}
conda info --envs
sed -i 's/judgemodel/'$(tail -n 1 /cpfs01/shared/public/llmeval/share_info/compassjuder_ip.txt)'/g' .github/scripts/eval_regression_chat.py
opencompass .github/scripts/eval_regression_chat.py --work-dir /cpfs01/user/qa-llm-cicd/report/${{ github.run_id }}/chat_${{ matrix.cuda_env }} --reuse --max-num-workers 2
rm regression_result_daily -f && ln -s /cpfs01/user/qa-llm-cicd/report/${{ github.run_id }}/chat_${{ matrix.cuda_env }}/*/summary regression_result_daily
python -m pytest -m chat -s -v --color=yes .github/scripts/oc_score_assert.py
- name: Run base model test
if: github.event_name == 'schedule' || contains(fromJSON(github.event.inputs.regression_func), 'base')
run: |
. /cpfs01/shared/public/qa-llm-cicd/miniconda3/bin/activate
conda activate ${{env.CONDA_ENV}}_${{ matrix.cuda_env }}
conda info --envs
opencompass .github/scripts/eval_regression_base.py --work-dir /cpfs01/user/qa-llm-cicd/report/${{ github.run_id }}/base_${{ matrix.cuda_env }} --reuse --max-num-workers 2
rm regression_result_daily -f && ln -s /cpfs01/user/qa-llm-cicd/report/${{ github.run_id }}/base_${{ matrix.cuda_env }}/*/summary regression_result_daily
python -m pytest -m base -s -v --color=yes .github/scripts/oc_score_assert.py
- name: Remove Conda Env
if: always()
run: |
rm -rf regression_result_daily
. /cpfs01/shared/public/qa-llm-cicd/miniconda3/bin/activate
conda env remove -y --name ${{env.CONDA_ENV}}_${{ matrix.cuda_env }}
. ${{ secrets.WORKSPACE_PREFIX }}/miniconda3/bin/activate
conda env remove -y --name ${{env.CONDA_ENV}}
conda info --envs
- name: Prepare - create conda env and install torch - cu12
uses: nick-fields/retry@v3
with:
max_attempts: 3
timeout_minutes: 120
command: |
. ${{env.CONDA_PATH}}/bin/activate
conda create -y --name ${{env.CONDA_ENV}} python=3.10
conda activate ${{env.CONDA_ENV}}
pip install -r ${{ secrets.WORKSPACE_PREFIX }}/config/requirements.txt --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass*.whl --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass[lmdeploy] --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass[vllm] --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass[full] --cache-dir ${{env.PIP_CACHE_PATH}}
pip install opencompass[api] --cache-dir ${{env.PIP_CACHE_PATH}}
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --cache-dir ${{env.PIP_CACHE_PATH}}
FLASH_ATTENTION_FORCE_BUILD=TRUE pip install ${{ secrets.WORKSPACE_PREFIX }}/packages/flash_attn-2.7.0.post2+cu12torch2.5cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install xformers --index-url https://download.pytorch.org/whl/cu121 --cache-dir ${{env.PIP_CACHE_PATH}}
cp -r /root/nltk_data ${{env.CONDA_PATH}}/envs/${{env.CONDA_ENV}}/nltk_data
- name: Prepare - reinstall lmdeploy - cu12
if: ${{github.event_name == 'schedule' || inputs.build_lmdeploy}}
uses: actions/download-artifact@v4
with:
name: my-artifact-${{ github.run_id }}-py310
- name: Prepare - reinstall lmdeploy - cu12
if: ${{github.event_name == 'schedule' || inputs.build_lmdeploy}}
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
pip uninstall -y lmdeploy
pip install lmdeploy-*.whl --no-deps
- name: conda env
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
pip list
daily_run_test_volc:
if: ${{!cancelled() && contains(needs.prepare_env.result, 'success')}}
needs: prepare_env
strategy:
fail-fast: false
matrix:
regression_func: ${{fromJSON(github.event.inputs.regression_func_volc || '["chat_models","base_models","chat_obj_fullbench","base_fullbench"]')}}
runs-on: volc_cu12_daily
timeout-minutes: 180 #3hours
steps:
- name: Clone repository
uses: actions/checkout@v2
with:
repository: ${{ github.event.inputs.repo_org || 'open-compass/opencompass' }}
ref: ${{github.event.inputs.repo_ref || 'main'}}
- name: conda env
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
pip list
- name: modify config
if: matrix.regression_func != 'chat_sub_fullbench'
run: |
cp -r ${{ secrets.WORKSPACE_PREFIX }}/ocplayground/template/configs_cluster/volc.py .
cat ${{ secrets.WORKSPACE_PREFIX }}/config/test_config.txt >> .github/scripts/eval_regression_${{matrix.regression_func}}.py
- name: Run test
uses: nick-fields/retry@v3
with:
max_attempts: 1
timeout_minutes: 180
command: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
opencompass .github/scripts/eval_regression_${{matrix.regression_func}}.py --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/${{matrix.regression_func}} --reuse --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/${{matrix.regression_func}}/*/summary regression_result_daily
python -m pytest -m ${{matrix.regression_func}} -s -v --color=yes .github/scripts/oc_score_assert.py
daily_run_test_local:
if: ${{!cancelled() && contains(needs.prepare_env.result, 'success')}}
needs: prepare_env
strategy:
fail-fast: false
matrix:
regression_func: ${{fromJSON(github.event.inputs.regression_func_local || '["cmd","api","chat_sub_fullbench"]')}}
runs-on: volc_cu12_local
timeout-minutes: 480 #6hours
steps:
- name: Clone repository
uses: actions/checkout@v2
with:
repository: ${{ github.event.inputs.repo_org || 'open-compass/opencompass' }}
ref: ${{github.event.inputs.repo_ref || 'main'}}
- name: conda env
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
pip list
- name: modify config
if: matrix.regression_func == 'chat_sub_fullbench'
run: |
cp -r ${{ secrets.WORKSPACE_PREFIX }}/ocplayground/template/configs_cluster/volc.py .
cat ${{ secrets.WORKSPACE_PREFIX }}/config/test_config_sub.txt >> .github/scripts/eval_regression_${{matrix.regression_func}}.py
- name: Run command testcase
if: matrix.regression_func == 'cmd'
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
export from_tf=TRUE
python tools/list_configs.py internlm2_5 mmlu
opencompass --models hf_internlm2_5_7b --datasets race_ppl demo_gsm8k_chat_gen --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd1 --reuse --max-num-workers 2 --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd1/*/summary regression_result_daily
python -m pytest -m case1 -s -v --color=yes .github/scripts/oc_score_assert.py
opencompass --models hf_internlm2_5_7b_chat hf_internlm3_8b_instruct --datasets race_gen demo_gsm8k_chat_gen -a lmdeploy --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd2 --reuse --max-num-workers 2 --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd2/*/summary regression_result_daily
python -m pytest -m case2 -s -v --color=yes .github/scripts/oc_score_assert.py
opencompass --datasets race_ppl demo_gsm8k_chat_gen --hf-type base --hf-path internlm/internlm2_5-7b --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd3 --reuse --max-num-workers 2 --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd3/*/summary regression_result_daily
python -m pytest -m case3 -s -v --color=yes .github/scripts/oc_score_assert.py
opencompass --datasets race_gen demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm3-8b-instruct -a lmdeploy --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd4 --reuse --max-num-workers 2 --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd4/*/summary regression_result_daily
python -m pytest -m case4 -s -v --color=yes .github/scripts/oc_score_assert.py
opencompass --datasets race_gen demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm3-8b-instruct -a vllm --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd5 --reuse --max-num-workers 2 --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/cmd5/*/summary regression_result_daily
python -m pytest -m case5 -s -v --color=yes .github/scripts/oc_score_assert.py
- name: Run model test - api
if: matrix.regression_func == 'api'
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
lmdeploy serve api_server internlm/internlm3-8b-instruct --max-batch-size 256 --model-name internlm3 > ${{env.REPORT_ROOT}}/${{ github.run_id }}/restful.log 2>&1 &
echo "restful_pid=$!" >> "$GITHUB_ENV"
sleep 180s
env | grep PROXY
env | grep proxy
unset HTTP_PROXY;unset HTTPS_PROXY;unset http_proxy;unset https_proxy;
opencompass .github/scripts/eval_regression_api.py --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/api --reuse --max-num-workers 2 --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/api/*/summary regression_result_daily
python -m pytest -m api -s -v --color=yes .github/scripts/oc_score_assert.py
- name: Run model test - api kill
if: always() && matrix.regression_func == 'api'
run: |
kill -15 "$restful_pid"
- name: Run testcase
if: matrix.regression_func == 'chat_sub_fullbench'
env:
COMPASS_DATA_CACHE: ${{ secrets.SHARESPACE_PREFIX }}/datasets/compass_data_cache_subset
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
export from_tf=TRUE
opencompass .github/scripts/eval_regression_${{matrix.regression_func}}.py --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/${{matrix.regression_func}} --reuse --dump-eval-details
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/${{matrix.regression_func}}/*/summary regression_result_daily
python -m pytest -m ${{matrix.regression_func}} -s -v --color=yes .github/scripts/oc_score_assert.py
fullbench_run_test:
if: ${{!cancelled() && contains(needs.prepare_env.result, 'success')}}
needs: prepare_env
strategy:
fail-fast: false
matrix:
function_type: ${{fromJSON(github.event.inputs.fullbench_eval || '["base_objective","chat_objective","chat_subjective","base_long_context","chat_long_context"]')}}
runs-on: volc_cu12
timeout-minutes: 480 #6hours
steps:
- name: Clone repository
uses: actions/checkout@v2
with:
repository: ${{ github.event.inputs.repo_org || 'open-compass/opencompass' }}
ref: ${{github.event.inputs.repo_ref || 'main'}}
- name: conda env
run: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
pip list
- name: Run testcase
uses: nick-fields/retry@v3
with:
max_attempts: 1
timeout_minutes: 480
command: |
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
export from_tf=TRUE
opencompass ${{ secrets.WORKSPACE_PREFIX }}/ocplayground/template/regression/eval_${{ matrix.function_type }}.py --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/${{ matrix.function_type }} --reuse
rm regression_result_daily -f && ln -s ${{env.REPORT_ROOT}}/${{ github.run_id }}/${{ matrix.function_type }}/*/summary regression_result_daily
python -m pytest -m ${{ matrix.function_type }} -s -v --color=yes .github/scripts/oc_score_assert.py
notify_to_feishu:
if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'develop' || github.ref_name == 'main') }}
needs: [daily_run_test]
environment: 'prod'
if: ${{ always() && github.event_name == 'schedule' && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'develop' || github.ref_name == 'main') }}
needs: [daily_run_test_volc, daily_run_test_local, fullbench_run_test]
timeout-minutes: 5
runs-on: self-hosted
steps:

View File

@ -17,7 +17,7 @@ jobs:
python-version: '3.10'
- name: Install pre-commit hook
run: |
pip install pre-commit==3.8.0 mmengine
pip install pre-commit==3.8.0 mmengine==0.10.5
pre-commit install
- name: Linting
run: pre-commit run --all-files

View File

@ -8,106 +8,98 @@ on:
- 'docs/**'
- 'configs/**'
- 'tools/**'
workflow_dispatch:
inputs:
repo_org:
required: false
description: 'Tested repository organization name. Default is open-compass/opencompass'
type: string
default: 'open-compass/opencompass'
repo_ref:
required: false
description: 'Set branch or tag or commit id. Default is "main"'
type: string
default: 'main'
schedule:
- cron: '56 22 * * *'
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
env:
CONDA_ENV: opencompass_
USERSPACE_PREFIX: /cpfs01/user/qa-llm-cicd
HF_CACHE_PATH: /cpfs01/shared/public/public_hdd/llmeval/model_weights/hf_hub
CONDA_ENV: pr_test
HF_DATASETS_OFFLINE: 1
HF_EVALUATE_OFFLINE: 1
TRANSFORMERS_OFFLINE: 1
HF_HUB_OFFLINE: 1
VLLM_USE_MODELSCOPE: false
LMDEPLOY_USE_MODELSCOPE: false
HF_HUB_OFFLINE: 1
CONDA_PATH: /fs-computility/llm/qa-llm-cicd/miniconda3
PIP_CACHE_PATH: /fs-computility/llm/qa-llm-cicd/.cache/pip
REPORT_ROOT: /fs-computility/llm/qa-llm-cicd/eval_report/prtest
COMPASS_DATA_CACHE: /fs-computility/llm/shared/llmeval/datasets/compass_data_cache
HUGGINGFACE_HUB_CACHE: /fs-computility/llm/shared/llmeval/models/opencompass_hf_hub
HF_HUB_CACHE: /fs-computility/llm/shared/llmeval/models/opencompass_hf_hub
jobs:
pr_run_test:
runs-on: self-hosted
runs-on: volc_cu12_local
environment: 'prod'
timeout-minutes: 30
steps:
- name: Checkout repository
uses: actions/checkout@v2
with:
repository: ${{ github.event.inputs.repo_org || 'open-compass/opencompass' }}
ref: ${{github.event.inputs.repo_ref || 'main'}}
- name: Prepare - Install opencompass
run: |
. /cpfs01/shared/public/qa-llm-cicd/miniconda3/bin/activate
conda activate ${{env.CONDA_ENV}}${{ runner.name }}
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
python3 -m pip uninstall opencompass -y
python3 -m pip install -e . --cache-dir ${{env.USERSPACE_PREFIX}}/.cache/pip
python3 -m pip install -e ".[full]" --cache-dir ${{env.PIP_CACHE_PATH}}
conda info --envs
- name: Prepare - prepare data and hf model
- name: conda env
run: |
cp -r ${{env.USERSPACE_PREFIX}}/data .
rm -rf ~/.cache/huggingface/hub -f && mkdir ~/.cache -p && mkdir ~/.cache/huggingface -p
ln -s ${{env.HF_CACHE_PATH}} ~/.cache/huggingface/hub
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
pip list
lmdeploy check_env
- name: Run test
run: |
. /cpfs01/shared/public/qa-llm-cicd/miniconda3/bin/activate
conda activate ${{env.CONDA_ENV}}${{ runner.name }}
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
conda info --envs
rm -rf regression_result
opencompass --models hf_internlm2_5_20b_chat --datasets demo_gsm8k_chat_gen --work-dir regression_result1 --debug
opencompass --models hf_internlm2_5_7b_chat --datasets demo_gsm8k_chat_gen --work-dir regression_result2 --debug --max-num-workers 2
opencompass --models hf_internlm2_5_7b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy --work-dir regression_result3 --debug --max-num-workers 2
opencompass --models hf_internlm2_5_20b_chat --datasets demo_gsm8k_chat_gen --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/regression_result1 --debug
opencompass --models hf_internlm2_5_7b_chat --datasets demo_gsm8k_chat_gen --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/regression_result2 --debug --max-num-workers 2
opencompass --models hf_internlm2_5_7b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy --work-dir ${{env.REPORT_ROOT}}/${{ github.run_id }}/regression_result3 --debug --max-num-workers 2
- name: Get result
run: |
score=$(sed -n '$p' regression_result1/*/summary/*.csv | awk -F ',' '{print $NF}')
score=$(sed -n '$p' ${{env.REPORT_ROOT}}/${{ github.run_id }}/regression_result1/*/summary/*.csv | awk -F ',' '{print $NF}')
if (( ${score%.*} >= 88 && ${score%.*} <= 89 )); then
echo "score is $score between 88 and 89"
else
echo "score is $score not between 88 and 89"
exit 1
fi
score=$(sed -n '$p' regression_result2/*/summary/*.csv | awk -F ',' '{print $NF}')
score=$(sed -n '$p' ${{env.REPORT_ROOT}}/${{ github.run_id }}/regression_result2/*/summary/*.csv | awk -F ',' '{print $NF}')
if (( ${score%.*} >= 87 && ${score%.*} <= 88 )); then
echo "score is $score between 87 and 88"
else
echo "score is $score not between 87 and 88"
exit 1
fi
score=$(sed -n '$p' regression_result3/*/summary/*.csv | awk -F ',' '{print $NF}')
if (( ${score%.*} >= 84 && ${score%.*} <= 87 )); then
echo "score is $score between 84 and 87"
score=$(sed -n '$p' ${{env.REPORT_ROOT}}/${{ github.run_id }}/regression_result3/*/summary/*.csv | awk -F ',' '{print $NF}')
if (( ${score%.*} >= 87 && ${score%.*} <= 91 )); then
echo "score is $score between 87 and 91"
else
echo "score is $score not between 84 and 87"
echo "score is $score not between 87 and 91"
exit 1
fi
rm -rf regression_result1 & rm -rf regression_result2 & rm -rf regression_result3
- name: Uninstall opencompass
if: always()
run: |
. /cpfs01/shared/public/qa-llm-cicd/miniconda3/bin/activate
conda activate ${{env.CONDA_ENV}}${{ runner.name }}
. ${{env.CONDA_PATH}}/bin/activate
conda activate ${{env.CONDA_ENV}}
python3 -m pip uninstall opencompass -y
conda info --envs
notify_to_feishu:
if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'develop' || github.ref_name == 'main') }}
needs: [pr_run_test]
environment: 'prod'
timeout-minutes: 5
runs-on: self-hosted
environment: 'prod'
steps:
- name: notify
run: |

View File

@ -20,7 +20,7 @@ jobs:
matrix:
python-version: ['3.10']
include:
- torch: 2.0.0
- torch: 2.5.1
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
@ -30,7 +30,7 @@ jobs:
- name: Upgrade pip
run: python -m pip install --upgrade pip
- name: Install PyTorch
run: pip install torch==${{matrix.torch}}+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html
run: pip install torch==${{matrix.torch}} -f https://download.pytorch.org/whl/cpu/torch_stable.html
- name: Install system dependencies
run: |
sudo sed -i '$ a deb http://th.archive.ubuntu.com/ubuntu jammy main' /etc/apt/sources.list
@ -106,7 +106,7 @@ jobs:
- name: Upgrade pip
run: python -m pip install pip --upgrade
- name: Install PyTorch
run: pip install torch==2.0.0+${{matrix.platform}} -f https://download.pytorch.org/whl/${{matrix.platform}}/torch_stable.html
run: pip install torch==2.5.1 -f https://download.pytorch.org/whl/cpu/torch_stable.html
- name: Install opencompass dependencies
run: |
pip install -r requirements.txt

View File

@ -1,21 +1,26 @@
name: deploy
on: push
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
on:
push:
workflow_dispatch:
inputs:
confirm_publish:
description: 'Type YES to confirm publishing to PyPI'
required: true
type: string
jobs:
build-n-publish:
runs-on: ubuntu-latest
if: startsWith(github.event.ref, 'refs/tags')
if: |
github.event_name == 'push' && startsWith(github.event.ref, 'refs/tags') ||
(github.event_name == 'workflow_dispatch' && inputs.confirm_publish == 'YES')
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.7
uses: actions/setup-python@v1
- name: Set up Python 3.10
uses: actions/setup-python@v4
with:
python-version: 3.7
python-version: '3.10'
- name: Build lagent
run: |
pip install wheel

View File

@ -1,6 +1,7 @@
exclude: |
(?x)^(
tests/data/|
tests/dataset/|
opencompass/models/internal/|
opencompass/utils/internal/|
opencompass/openicl/icl_evaluator/hf_metrics/|
@ -10,12 +11,9 @@ exclude: |
opencompass/datasets/teval/|
opencompass/datasets/NPHardEval/|
opencompass/datasets/TheoremQA|
opencompass/datasets/subjective/mtbench101.py|
docs/zh_cn/advanced_guides/compassbench_intro.md |
docs/zh_cn/advanced_guides/compassbench_v2_0.md |
opencompass/configs/datasets/ |
opencompass/configs/models/|
opencompass/configs/summarizers/|
opencompass/configs/dataset_collections/ |
opencompass/utils/datasets.py |
opencompass/utils/datasets_info.py
)
@ -26,8 +24,8 @@ repos:
- id: flake8
exclude: |
(?x)^(
configs/ |
example_scripts/
opencompass/configs/|
examples/
)
- repo: https://gitee.com/openmmlab/mirrors-isort
rev: 5.11.5
@ -35,8 +33,8 @@ repos:
- id: isort
exclude: |
(?x)^(
configs/ |
example_scripts/
opencompass/configs/|
examples/
)
- repo: https://gitee.com/openmmlab/mirrors-yapf
rev: v0.32.0
@ -44,8 +42,8 @@ repos:
- id: yapf
exclude: |
(?x)^(
configs/ |
example_scripts/
opencompass/configs/|
examples/
)
- repo: https://gitee.com/openmmlab/mirrors-codespell
rev: v2.2.1
@ -55,9 +53,8 @@ repos:
(?x)^(
.*\.jsonl|
.*\.md.template|
configs/ |
opencompass/configs/ |
example_scripts/
examples/
)
- repo: https://gitee.com/openmmlab/mirrors-pre-commit-hooks
rev: v4.3.0
@ -67,7 +64,6 @@ repos:
(?x)^(
dicts/|
projects/.*?/dicts/|
configs/.*?/.*\.txt
)
- id: check-yaml
- id: end-of-file-fixer
@ -75,7 +71,6 @@ repos:
(?x)^(
dicts/|
projects/.*?/dicts/|
configs/.*?/.*\.txt
)
- id: requirements-txt-fixer
- id: double-quote-string-fixer
@ -107,7 +102,7 @@ repos:
language: script
pass_filenames: true
require_serial: true
files: ^configs/datasets
files: ^opencompass/configs/datasets
- repo: local
hooks:
- id: update-dataset-suffix-pacakge
@ -120,44 +115,15 @@ repos:
args:
- --root_folder
- opencompass/configs/datasets
- repo: local
- repo: https://gitee.com/mirrors/gitleaks
rev: v8.23.1
hooks:
- id: compare-configs-datasets
name: compare configs datasets
entry: ./tools/compare_configs.py
language: script
pass_filenames: false
# require_serial: true
args:
- configs/datasets
- opencompass/configs/datasets
- repo: local
hooks:
- id: compare-configs-models
name: compare configs models
entry: ./tools/compare_configs.py
language: script
pass_filenames: false
# require_serial: true
args:
- configs/models
- opencompass/configs/models
- --ignore
- llama
- repo: local
hooks:
- id: compare-configs-summarizers
name: compare configs summarizers
entry: ./tools/compare_configs.py
language: script
pass_filenames: false
# require_serial: true
args:
- configs/summarizers
- opencompass/configs/summarizers
- id: gitleaks
entry: "gitleaks dir"
args: ["--verbose", "--redact=50"]
# - repo: https://github.com/open-mmlab/pre-commit-hooks
# rev: v0.2.0 # Use the ref you want to point at
# hooks:
# - id: check-algo-readme
# - id: check-copyright
# args: ["mmocr", "tests", "tools"] # these directories will be checked
# args: ["mmocr", "tests", "tools"] # these directories will be checked

View File

@ -8,16 +8,13 @@ exclude: |
opencompass/datasets/lawbench/utils|
opencompass/datasets/lawbench/evaluation_functions/|
opencompass/datasets/medbench/|
opencompass/datasets/matbench/|
opencompass/datasets/teval/|
opencompass/datasets/NPHardEval/|
opencompass/datasets/TheoremQA|
opencompass/datasets/subjective/mtbench101.py|
docs/zh_cn/advanced_guides/compassbench_intro.md |
docs/zh_cn/advanced_guides/compassbench_v2_0.md |
opencompass/configs/datasets/ |
opencompass/configs/models/|
opencompass/configs/summarizers/ |
opencompass/configs/dataset_collections/ |
opencompass/utils/datasets.py |
opencompass/utils/datasets_info.py
)
@ -28,8 +25,8 @@ repos:
- id: flake8
exclude: |
(?x)^(
configs/ |
example_scripts/
opencompass/configs/|
examples/
)
- repo: https://github.com/PyCQA/isort
rev: 5.11.5
@ -37,8 +34,8 @@ repos:
- id: isort
exclude: |
(?x)^(
configs/ |
example_scripts/
opencompass/configs/|
examples/
)
- repo: https://github.com/pre-commit/mirrors-yapf
rev: v0.32.0
@ -46,8 +43,8 @@ repos:
- id: yapf
exclude: |
(?x)^(
configs/ |
example_scripts/
opencompass/configs/|
examples/
)
- repo: https://github.com/codespell-project/codespell
rev: v2.2.1
@ -57,9 +54,8 @@ repos:
(?x)^(
.*\.jsonl|
.*\.md.template|
configs/ |
opencompass/configs/ |
example_scripts/
examples/
)
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
@ -69,7 +65,6 @@ repos:
(?x)^(
dicts/|
projects/.*?/dicts/|
configs/.*?/.*\.txt
)
- id: check-yaml
- id: end-of-file-fixer
@ -77,7 +72,6 @@ repos:
(?x)^(
dicts/|
projects/.*?/dicts/|
configs/.*?/.*\.txt
)
- id: requirements-txt-fixer
- id: double-quote-string-fixer
@ -109,7 +103,7 @@ repos:
language: script
pass_filenames: true
require_serial: true
files: ^configs/datasets
files: ^opencompass/configs/datasets
- repo: local
hooks:
- id: update-dataset-suffix-pacakge
@ -122,45 +116,15 @@ repos:
args:
- --root_folder
- opencompass/configs/datasets
- repo: local
- repo: https://github.com/gitleaks/gitleaks
rev: v8.23.1
hooks:
- id: compare-configs-datasets
name: compare configs datasets
entry: ./tools/compare_configs.py
language: script
pass_filenames: false
# require_serial: true
args:
- configs/datasets
- opencompass/configs/datasets
- repo: local
hooks:
- id: compare-configs-models
name: compare configs models
entry: ./tools/compare_configs.py
language: script
pass_filenames: false
# require_serial: true
args:
- configs/models
- opencompass/configs/models
- --ignore
- llama
- repo: local
hooks:
- id: compare-configs-summarizers
name: compare configs summarizers
entry: ./tools/compare_configs.py
language: script
pass_filenames: false
# require_serial: true
args:
- configs/summarizers
- opencompass/configs/summarizers
- id: gitleaks
entry: "gitleaks dir"
args: ["--verbose", "--redact=50"]
# - repo: https://github.com/open-mmlab/pre-commit-hooks
# rev: v0.2.0 # Use the ref you want to point at
# hooks:
# - id: check-algo-readme
# - id: check-copyright
# args: ["mmocr", "tests", "tools"] # these directories will be checked
# args: ["mmocr", "tests", "tools"] # these directories will be checked

View File

@ -1,2 +1,3 @@
recursive-include opencompass/configs *.py *.yml *.json *.txt *.md
recursive-include opencompass/openicl/icl_evaluator/hf_metrics *.py
recursive-include opencompass/datasets *.py *.yml *.json *.txt *.md *.yaml

368
README.md
View File

@ -57,6 +57,14 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2025.04.01\]** OpenCompass now supports `CascadeEvaluator`, a flexible evaluation mechanism that allows multiple evaluators to work in sequence. This enables creating customized evaluation pipelines for complex assessment scenarios. Check out the [documentation](docs/en/advanced_guides/llm_judge.md) for more details! 🔥🔥🔥
- **\[2025.03.11\]** We have supported evaluation for `SuperGPQA` which is a great benchmark for measuring LLM knowledge ability 🔥🔥🔥
- **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHVerifyEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
- **\[2025.01.16\]** We now support the [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) model which has enhanced performance on reasoning and knowledge-intensive tasks.
- **\[2024.12.17\]** We have provided the evaluation script for the December [CompassAcademic](examples/eval_academic_leaderboard_202412.py), which allows users to easily reproduce the official evaluation results by configuring it.
- **\[2024.11.14\]** OpenCompass now offers support for a sophisticated benchmark designed to evaluate complex reasoning skills — [MuSR](https://arxiv.org/pdf/2310.16049). Check out the [demo](examples/eval_musr.py) and give it a spin! 🔥🔥🔥
- **\[2024.11.14\]** OpenCompass now supports the brand new long-context language model evaluation benchmark — [BABILong](https://arxiv.org/pdf/2406.10149). Have a look at the [demo](examples/eval_babilong.py) and give it a try! 🔥🔥🔥
- **\[2024.10.14\]** We now support the OpenAI multilingual QA dataset [MMMLU](https://huggingface.co/datasets/openai/MMMLU). Feel free to give it a try! 🔥🔥🔥
- **\[2024.09.19\]** We now support [Qwen2.5](https://huggingface.co/Qwen)(0.5B to 72B) with multiple backend(huggingface/vllm/lmdeploy). Feel free to give them a try! 🔥🔥🔥
- **\[2024.09.17\]** We now support OpenAI o1(`o1-mini-2024-09-12` and `o1-preview-2024-09-12`). Feel free to give them a try! 🔥🔥🔥
@ -76,6 +84,8 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
We provide [OpenCompass Leaderboard](https://rank.opencompass.org.cn/home) for the community to rank all public models and API models. If you would like to join the evaluation, please provide the model repository URL or a standard API interface to the email address `opencompass@pjlab.org.cn`.
You can also refer to [CompassAcademic](configs/eval_academic_leaderboard_202412.py) to quickly reproduce the leaderboard results. The currently selected datasets include Knowledge Reasoning (MMLU-Pro/GPQA Diamond), Logical Reasoning (BBH), Mathematical Reasoning (MATH-500, AIME), Code Generation (LiveCodeBench, HumanEval), and Instruction Following (IFEval)."
<p align="right"><a href="#top">🔝Back to top</a></p>
## 🛠️ Installation
@ -167,69 +177,83 @@ Some third-party features, like Humaneval and Llama, may require additional step
After ensuring that OpenCompass is installed correctly according to the above steps and the datasets are prepared. Now you can start your first evaluation using OpenCompass!
- Your first evaluation with OpenCompass!
### Your first evaluation with OpenCompass!
OpenCompass support setting your configs via CLI or a python script. For simple evaluation settings we recommend using CLI, for more complex evaluation, it is suggested using the script way. You can find more example scripts under the configs folder.
OpenCompass support setting your configs via CLI or a python script. For simple evaluation settings we recommend using CLI, for more complex evaluation, it is suggested using the script way. You can find more example scripts under the configs folder.
```bash
# CLI
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen
```bash
# CLI
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen
# Python scripts
opencompass ./configs/eval_chat_demo.py
```
# Python scripts
opencompass examples/eval_chat_demo.py
```
You can find more script examples under [configs](./configs) folder.
You can find more script examples under [examples](./examples) folder.
- API evaluation
### API evaluation
OpenCompass, by its design, does not really discriminate between open-source models and API models. You can evaluate both model types in the same way or even in one settings.
OpenCompass, by its design, does not really discriminate between open-source models and API models. You can evaluate both model types in the same way or even in one settings.
```bash
export OPENAI_API_KEY="YOUR_OPEN_API_KEY"
# CLI
opencompass --models gpt_4o_2024_05_13 --datasets demo_gsm8k_chat_gen
```bash
export OPENAI_API_KEY="YOUR_OPEN_API_KEY"
# CLI
opencompass --models gpt_4o_2024_05_13 --datasets demo_gsm8k_chat_gen
# Python scripts
opencompass ./configs/eval_api_demo.py
# Python scripts
opencompass examples/eval_api_demo.py
# You can use o1_mini_2024_09_12/o1_preview_2024_09_12 for o1 models, we set max_completion_tokens=8192 as default.
```
# You can use o1_mini_2024_09_12/o1_preview_2024_09_12 for o1 models, we set max_completion_tokens=8192 as default.
```
- Accelerated Evaluation
### Accelerated Evaluation
Additionally, if you want to use an inference backend other than HuggingFace for accelerated evaluation, such as LMDeploy or vLLM, you can do so with the command below. Please ensure that you have installed the necessary packages for the chosen backend and that your model supports accelerated inference with it. For more information, see the documentation on inference acceleration backends [here](docs/en/advanced_guides/accelerator_intro.md). Below is an example using LMDeploy:
Additionally, if you want to use an inference backend other than HuggingFace for accelerated evaluation, such as LMDeploy or vLLM, you can do so with the command below. Please ensure that you have installed the necessary packages for the chosen backend and that your model supports accelerated inference with it. For more information, see the documentation on inference acceleration backends [here](docs/en/advanced_guides/accelerator_intro.md). Below is an example using LMDeploy:
```bash
# CLI
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy
```bash
# CLI
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy
# Python scripts
opencompass ./configs/eval_lmdeploy_demo.py
```
# Python scripts
opencompass examples/eval_lmdeploy_demo.py
```
- Supported Models
### Supported Models and Datasets
OpenCompass has predefined configurations for many models and datasets. You can list all available model and dataset configurations using the [tools](./docs/en/tools.md#list-configs).
OpenCompass has predefined configurations for many models and datasets. You can list all available model and dataset configurations using the [tools](./docs/en/tools.md#list-configs).
```bash
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
```
```bash
# List all configurations
python tools/list_configs.py
# List all configurations related to llama and mmlu
python tools/list_configs.py llama mmlu
```
If the model is not on the list but supported by Huggingface AutoModel class, you can also evaluate it with OpenCompass. You are welcome to contribute to the maintenance of the OpenCompass supported model and dataset lists.
#### Supported Models
```bash
opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat
```
If the model is not on the list but supported by Huggingface AutoModel class or encapsulation of inference engine based on OpenAI interface (see [docs](https://opencompass.readthedocs.io/en/latest/advanced_guides/new_model.html) for details), you can also evaluate it with OpenCompass. You are welcome to contribute to the maintenance of the OpenCompass supported model and dataset lists.
If you want to use multiple GPUs to evaluate the model in data parallel, you can use `--max-num-worker`.
```bash
opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat
```
```bash
CUDA_VISIBLE_DEVICES=0,1 opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat --max-num-worker 2
```
#### Supported Datasets
Currently, OpenCompass have provided standard recommended configurations for datasets. Generally, config files ending with `_gen.py` or `_llm_judge_gen.py` will point to the recommended config we provide for this dataset. You can refer to [docs](https://opencompass.readthedocs.io/en/latest/dataset_statistics.html) for more details.
```bash
# Recommended Evaluation Config based on Rules
opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
# Recommended Evaluation Config based on LLM Judge
opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
```
If you want to use multiple GPUs to evaluate the model in data parallel, you can use `--max-num-worker`.
```bash
CUDA_VISIBLE_DEVICES=0,1 opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat --max-num-worker 2
```
> \[!TIP\]
>
@ -273,263 +297,15 @@ OpenCompass is a one-stop platform for large model evaluation, aiming to provide
## 📖 Dataset Support
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>Language</b>
</td>
<td>
<b>Knowledge</b>
</td>
<td>
<b>Reasoning</b>
</td>
<td>
<b>Examination</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>Word Definition</b></summary>
We have supported a statistical list of all datasets that can be used on this platform in the documentation on the OpenCompass website.
- WiC
- SummEdits
You can quickly find the dataset you need from the list through sorting, filtering, and searching functions.
</details>
In addition, we provide a recommended configuration for each dataset, and some datasets also support LLM Judge-based configurations.
<details open>
<summary><b>Idiom Learning</b></summary>
Please refer to the dataset statistics chapter of [docs](https://opencompass.readthedocs.io/en/latest/dataset_statistics.html) for details.
- CHID
</details>
<details open>
<summary><b>Semantic Similarity</b></summary>
- AFQMC
- BUSTM
</details>
<details open>
<summary><b>Coreference Resolution</b></summary>
- CLUEWSC
- WSC
- WinoGrande
</details>
<details open>
<summary><b>Translation</b></summary>
- Flores
- IWSLT2017
</details>
<details open>
<summary><b>Multi-language Question Answering</b></summary>
- TyDi-QA
- XCOPA
</details>
<details open>
<summary><b>Multi-language Summary</b></summary>
- XLSum
</details>
</td>
<td>
<details open>
<summary><b>Knowledge Question Answering</b></summary>
- BoolQ
- CommonSenseQA
- NaturalQuestions
- TriviaQA
</details>
</td>
<td>
<details open>
<summary><b>Textual Entailment</b></summary>
- CMNLI
- OCNLI
- OCNLI_FC
- AX-b
- AX-g
- CB
- RTE
- ANLI
</details>
<details open>
<summary><b>Commonsense Reasoning</b></summary>
- StoryCloze
- COPA
- ReCoRD
- HellaSwag
- PIQA
- SIQA
</details>
<details open>
<summary><b>Mathematical Reasoning</b></summary>
- MATH
- GSM8K
</details>
<details open>
<summary><b>Theorem Application</b></summary>
- TheoremQA
- StrategyQA
- SciBench
</details>
<details open>
<summary><b>Comprehensive Reasoning</b></summary>
- BBH
</details>
</td>
<td>
<details open>
<summary><b>Junior High, High School, University, Professional Examinations</b></summary>
- C-Eval
- AGIEval
- MMLU
- GAOKAO-Bench
- CMMLU
- ARC
- Xiezhi
</details>
<details open>
<summary><b>Medical Examinations</b></summary>
- CMB
</details>
</td>
</tr>
</td>
</tr>
</tbody>
<tbody>
<tr align="center" valign="bottom">
<td>
<b>Understanding</b>
</td>
<td>
<b>Long Context</b>
</td>
<td>
<b>Safety</b>
</td>
<td>
<b>Code</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>Reading Comprehension</b></summary>
- C3
- CMRC
- DRCD
- MultiRC
- RACE
- DROP
- OpenBookQA
- SQuAD2.0
</details>
<details open>
<summary><b>Content Summary</b></summary>
- CSL
- LCSTS
- XSum
- SummScreen
</details>
<details open>
<summary><b>Content Analysis</b></summary>
- EPRSTMT
- LAMBADA
- TNEWS
</details>
</td>
<td>
<details open>
<summary><b>Long Context Understanding</b></summary>
- LEval
- LongBench
- GovReports
- NarrativeQA
- Qasper
</details>
</td>
<td>
<details open>
<summary><b>Safety</b></summary>
- CivilComments
- CrowsPairs
- CValues
- JigsawMultilingual
- TruthfulQA
</details>
<details open>
<summary><b>Robustness</b></summary>
- AdvGLUE
</details>
</td>
<td>
<details open>
<summary><b>Code</b></summary>
- HumanEval
- HumanEvalX
- MBPP
- APPs
- DS1000
</details>
</td>
</tr>
</td>
</tr>
</tbody>
</table>
<p align="right"><a href="#top">🔝Back to top</a></p>
## 📖 Model Support

View File

@ -57,6 +57,12 @@
## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
- **\[2025.04.01\]** OpenCompass 现已支持 `CascadeEvaluator`,允许多个评估器按顺序工作,可以为更复杂的评估场景创建自定义评估流程,查看[文档](docs/zh_cn/advanced_guides/llm_judge.md)了解具体用法!🔥🔥🔥
- **\[2025.03.11\]** 现已支持 `SuperGPQA` 覆盖285 个研究生学科的知识能力评测,欢迎尝试!🔥🔥🔥
- **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程,请查看 [评估推理模型](docs/zh_cn/user_guides/deepseek_r1.md) 了解更多详情!🔥🔥🔥
- **\[2025.02.15\]** 我们新增了两个实用的评测工具用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHVerifyEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情!🔥🔥🔥
- **\[2025.01.16\]** 我们现已支持 [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) 模型,该模型在推理、知识类任务上取得同量级最优性能,欢迎尝试。
- **\[2024.12.17\]** 我们提供了12月CompassAcademic学术榜单评估脚本 [CompassAcademic](configs/eval_academic_leaderboard_202412.py),你可以通过简单地配置复现官方评测结果。
- **\[2024.10.14\]** 现已支持OpenAI多语言问答数据集[MMMLU](https://huggingface.co/datasets/openai/MMMLU),欢迎尝试! 🔥🔥🔥
- **\[2024.09.19\]** 现已支持[Qwen2.5](https://huggingface.co/Qwen)(0.5B to 72B) ,可以使用多种推理后端(huggingface/vllm/lmdeploy), 欢迎尝试! 🔥🔥🔥
- **\[2024.09.05\]** 现已支持OpenAI o1 模型(`o1-mini-2024-09-12` and `o1-preview-2024-09-12`), 欢迎尝试! 🔥🔥🔥
@ -76,6 +82,8 @@
我们将陆续提供开源模型和 API 模型的具体性能榜单,请见 [OpenCompass Leaderboard](https://rank.opencompass.org.cn/home) 。如需加入评测,请提供模型仓库地址或标准的 API 接口至邮箱 `opencompass@pjlab.org.cn`.
你也可以参考[CompassAcademic](configs/eval_academic_leaderboard_202412.py),快速地复现榜单的结果,目前选取的数据集包括 综合知识推理 (MMLU-Pro/GPQA Diamond) ,逻辑推理 (BBH) ,数学推理 (MATH-500, AIME) ,代码生成 (LiveCodeBench, HumanEval) ,指令跟随 (IFEval) 。
<p align="right"><a href="#top">🔝返回顶部</a></p>
## 🛠️ 安装指南
@ -165,17 +173,17 @@ humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ce
- ### 首次评测
OpenCompass 支持通过命令行界面 (CLI) 或 Python 脚本来设置配置。对于简单的评估设置,我们推荐使用 CLI而对于更复杂的评估则建议使用脚本方式。你可以在configs文件夹下找到更多脚本示例。
OpenCompass 支持通过命令行界面 (CLI) 或 Python 脚本来设置配置。对于简单的评估设置,我们推荐使用 CLI而对于更复杂的评估则建议使用脚本方式。你可以在examples文件夹下找到更多脚本示例。
```bash
# 命令行界面 (CLI)
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen
# Python 脚本
opencompass ./configs/eval_chat_demo.py
opencompass examples/eval_chat_demo.py
```
你可以在[configs](./configs) 文件夹下找到更多的脚本示例。
你可以在[examples](./examples) 文件夹下找到更多的脚本示例。
- ### API评测
@ -187,7 +195,7 @@ humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ce
opencompass --models gpt_4o_2024_05_13 --datasets demo_gsm8k_chat_gen
# Python 脚本
opencompass ./configs/eval_api_demo.py
opencompass examples/eval_api_demo.py
# 现已支持 o1_mini_2024_09_12/o1_preview_2024_09_12 模型, 默认情况下 max_completion_tokens=8192.
@ -201,9 +209,9 @@ humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ce
opencompass --models hf_internlm2_5_1_8b_chat --datasets demo_gsm8k_chat_gen -a lmdeploy
```
OpenCompass 预定义了许多模型和数据集的配置,你可以通过 [工具](./docs/zh_cn/tools.md#ListConfigs) 列出所有可用的模型和数据集配置。
- ### 支持的模型与数据集
- ### 支持的模型
OpenCompass 预定义了许多模型和数据集的配置,你可以通过 [工具](./docs/zh_cn/tools.md#ListConfigs) 列出所有可用的模型和数据集配置。
```bash
# 列出所有配置
@ -212,13 +220,27 @@ humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ce
python tools/list_configs.py llama mmlu
```
如果模型不在列表中但支持 Huggingface AutoModel 类,您仍然可以使用 OpenCompass 对其进行评估。欢迎您贡献维护 OpenCompass 支持的模型和数据集列表。
#### 支持的模型
如果模型不在列表中,但支持 Huggingface AutoModel 类或支持针对 OpenAI 接口的推理引擎封装(详见[官方文档](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/new_model.html)),您仍然可以使用 OpenCompass 对其进行评估。欢迎您贡献维护 OpenCompass 支持的模型和数据集列表。
```bash
opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat
```
如果你想在多块 GPU 上使用模型进行推理,您可以使用 `--max-num-worker` 参数。
#### 支持的数据集
目前OpenCompass针对数据集给出了标准的推荐配置。通常`_gen.py`或`_llm_judge_gen.py`为结尾的配置文件将指向我们为该数据集提供的推荐配置。您可以参阅[官方文档](https://opencompass.readthedocs.io/zh-cn/latest/dataset_statistics.html) 的数据集统计章节来获取详细信息。
```bash
# 基于规则的推荐配置
opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
# 基于LLM Judge的推荐配置
opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
```
此外,如果你想在多块 GPU 上使用模型进行推理,您可以使用 `--max-num-worker` 参数。
```bash
CUDA_VISIBLE_DEVICES=0,1 opencompass --datasets demo_gsm8k_chat_gen --hf-type chat --hf-path internlm/internlm2_5-1_8b-chat --max-num-worker 2
@ -270,263 +292,11 @@ OpenCompass 是面向大模型评测的一站式平台。其主要特点如下
## 📖 数据集支持
<table align="center">
<tbody>
<tr align="center" valign="bottom">
<td>
<b>语言</b>
</td>
<td>
<b>知识</b>
</td>
<td>
<b>推理</b>
</td>
<td>
<b>考试</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>字词释义</b></summary>
我们已经在OpenCompass官网的文档中支持了所有可在本平台上使用的数据集的统计列表。
- WiC
- SummEdits
您可以通过排序、筛选和搜索等功能从列表中快速找到您需要的数据集。
</details>
<details open>
<summary><b>成语习语</b></summary>
- CHID
</details>
<details open>
<summary><b>语义相似度</b></summary>
- AFQMC
- BUSTM
</details>
<details open>
<summary><b>指代消解</b></summary>
- CLUEWSC
- WSC
- WinoGrande
</details>
<details open>
<summary><b>翻译</b></summary>
- Flores
- IWSLT2017
</details>
<details open>
<summary><b>多语种问答</b></summary>
- TyDi-QA
- XCOPA
</details>
<details open>
<summary><b>多语种总结</b></summary>
- XLSum
</details>
</td>
<td>
<details open>
<summary><b>知识问答</b></summary>
- BoolQ
- CommonSenseQA
- NaturalQuestions
- TriviaQA
</details>
</td>
<td>
<details open>
<summary><b>文本蕴含</b></summary>
- CMNLI
- OCNLI
- OCNLI_FC
- AX-b
- AX-g
- CB
- RTE
- ANLI
</details>
<details open>
<summary><b>常识推理</b></summary>
- StoryCloze
- COPA
- ReCoRD
- HellaSwag
- PIQA
- SIQA
</details>
<details open>
<summary><b>数学推理</b></summary>
- MATH
- GSM8K
</details>
<details open>
<summary><b>定理应用</b></summary>
- TheoremQA
- StrategyQA
- SciBench
</details>
<details open>
<summary><b>综合推理</b></summary>
- BBH
</details>
</td>
<td>
<details open>
<summary><b>初中/高中/大学/职业考试</b></summary>
- C-Eval
- AGIEval
- MMLU
- GAOKAO-Bench
- CMMLU
- ARC
- Xiezhi
</details>
<details open>
<summary><b>医学考试</b></summary>
- CMB
</details>
</td>
</tr>
</td>
</tr>
</tbody>
<tbody>
<tr align="center" valign="bottom">
<td>
<b>理解</b>
</td>
<td>
<b>长文本</b>
</td>
<td>
<b>安全</b>
</td>
<td>
<b>代码</b>
</td>
</tr>
<tr valign="top">
<td>
<details open>
<summary><b>阅读理解</b></summary>
- C3
- CMRC
- DRCD
- MultiRC
- RACE
- DROP
- OpenBookQA
- SQuAD2.0
</details>
<details open>
<summary><b>内容总结</b></summary>
- CSL
- LCSTS
- XSum
- SummScreen
</details>
<details open>
<summary><b>内容分析</b></summary>
- EPRSTMT
- LAMBADA
- TNEWS
</details>
</td>
<td>
<details open>
<summary><b>长文本理解</b></summary>
- LEval
- LongBench
- GovReports
- NarrativeQA
- Qasper
</details>
</td>
<td>
<details open>
<summary><b>安全</b></summary>
- CivilComments
- CrowsPairs
- CValues
- JigsawMultilingual
- TruthfulQA
</details>
<details open>
<summary><b>健壮性</b></summary>
- AdvGLUE
</details>
</td>
<td>
<details open>
<summary><b>代码</b></summary>
- HumanEval
- HumanEvalX
- MBPP
- APPs
- DS1000
</details>
</td>
</tr>
</td>
</tr>
</tbody>
</table>
详情请参阅 [官方文档](https://opencompass.readthedocs.io/zh-cn/latest/dataset_statistics.html) 的数据集统计章节。
<p align="right"><a href="#top">🔝返回顶部</a></p>

View File

@ -1,43 +0,0 @@
from mmengine.config import read_base
from opencompass.models import AI360GPT
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='360GPT_S2_V9',
type=AI360GPT,
path='360GPT_S2_V9',
key='xxxxxxxxxxxx',
generation_kwargs={
'temperature': 0.9,
'max_tokens': 2048,
'top_p': 0.5,
'tok_k': 0,
'repetition_penalty': 1.05,
},
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir ='./output/api_360GPT_S2_V9'

View File

@ -1,44 +0,0 @@
from mmengine.config import read_base
from opencompass.models import BaiChuan
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='Baichuan2-53B',
type=BaiChuan,
path='Baichuan2-53B',
api_key='xxxxxx',
secret_key='xxxxx',
url='xxxxx',
generation_kwargs={
'temperature': 0.3,
'top_p': 0.85,
'top_k': 5,
'with_search_enhance': False,
},
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = 'outputs/api_baichuan53b/'

View File

@ -1,42 +0,0 @@
from mmengine.config import read_base
from opencompass.models import ERNIEBot
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='erniebot',
type=ERNIEBot,
path='erniebot',
key='xxxxxx', # please give you key
secretkey='xxxxxxxxx', # please give your group_id
url='xxxxxxxxx',
generation_kwargs = {
'temperature': 0.8,
},
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8
),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = 'outputs/api_erniebot/'

View File

@ -1,38 +0,0 @@
from mmengine.config import read_base
from opencompass.models import BailingAPI
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
from opencompass.configs.summarizers.medium import summarizer
datasets = [
*ceval_datasets,
]
models = [
dict(
path='Bailing-Lite-0830',
token='xxxxxx', # set your key here or in environment variable BAILING_API_KEY
url='https://bailingchat.alipay.com/chat/completions',
type=BailingAPI,
generation_kwargs={},
query_per_second=1,
max_seq_len=4096,
),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask),
),
)
work_dir = 'outputs/api_bailing/'

View File

@ -1,44 +0,0 @@
from mmengine.config import read_base
from opencompass.models import ByteDance
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
# from opencompass.configs.datasets.collections.chat_medium import datasets
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='skylark-pro-public',
type=ByteDance,
path='skylark-pro-public',
accesskey='xxxxxxx',
secretkey='xxxxxxx',
url='xxxxxx',
generation_kwargs={
'temperature': 0.7,
'top_p': 0.9,
'top_k': 0,
},
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = 'outputs/api_bytedance/'

View File

@ -1,40 +0,0 @@
from mmengine.config import read_base
from opencompass.models import Doubao
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
# from opencompass.configs.datasets.collections.chat_medium import datasets
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='Doubao-pro-128k',
type=Doubao,
path='ep-xxxxxx',
accesskey='Your_AK',
secretkey='Your_SK',
generation_kwargs={
'temperature': 0.1,
'top_p': 0.9,
},
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)), )
work_dir = 'outputs/api_doubao/'

View File

@ -1,37 +0,0 @@
from mmengine.config import read_base
from opencompass.models import MiniMax
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='minimax_abab5.5-chat',
type=MiniMax,
path='abab5.5-chat',
key='xxxxxxx', # please give you key
group_id='xxxxxxxx', # please give your group_id
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=4,
concurrent_users=4,
task=dict(type=OpenICLInferTask)),
)
work_dir = 'outputs/api_minimax/'

View File

@ -1,40 +0,0 @@
from mmengine.config import read_base
from opencompass.models import MoonShot
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='moonshot-v1-32k',
type=MoonShot,
path='moonshot-v1-32k',
key='xxxxxxx',
url= 'xxxxxxxx',
system_prompt= '你是 Kimi由 Moonshot AI 提供的人工智能助手,你更擅长中文和英文的对话。'
'你会为用户提供安全,有帮助,准确的回答。同时,你会拒绝一些涉及恐怖主义,种族歧视,'
'黄色暴力等问题的回答。Moonshot AI 为专有名词,不可翻译成其他语言。',
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=4,
concurrent_users=4,
task=dict(type=OpenICLInferTask)),
)
work_dir = 'outputs/api_moonshot/'

View File

@ -1,36 +0,0 @@
from mmengine.config import read_base
from opencompass.models import Nanbeige
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='nanbeige-plus',
type=Nanbeige,
path='nanbeige-plus',
key='xxxxxx',
query_per_second=1,
max_out_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir ='./output/nanbeige-plus'

View File

@ -1,42 +0,0 @@
from mmengine.config import read_base
from opencompass.models import PanGu
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='pangu',
type=PanGu,
path='pangu',
access_key='xxxxxx',
secret_key='xxxxxx',
url = 'xxxxxx',
# url of token sever, used for generate token, like "https://xxxxxx.myhuaweicloud.com/v3/auth/tokens",
token_url = 'xxxxxx',
# scope-project-name, used for generate token
project_name = 'xxxxxx',
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = 'outputs/api_pangu/'

View File

@ -1,40 +0,0 @@
from mmengine.config import read_base
from opencompass.models import Qwen
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='qwen-max',
type=Qwen,
path='qwen-max',
key='xxxxxxxxxxxxxxxx', # please give you key
generation_kwargs={
'enable_search': False,
},
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8
),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=1,
concurrent_users=1,
task=dict(type=OpenICLInferTask)),
)
work_dir = 'outputs/api_qwen/'

View File

@ -1,39 +0,0 @@
from mmengine.config import read_base
from opencompass.models import Rendu
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from ..summarizers.medium import summarizer
from ..datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets
]
models = [
dict(
abbr='Rendu',
type=Rendu,
path='rendu',
key='xxxxxx',
url='xxxxxx',
generation_kwargs={
'temperature': 0.1,
'top_p': 0.9,
},
query_per_second=10,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=1,
concurrent_users=1,
task=dict(type=OpenICLInferTask)), )
work_dir = 'outputs/api_rendu/'

View File

@ -1,52 +0,0 @@
from mmengine.config import read_base
from opencompass.models import SenseTime
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='nova-ptc-xl-v1',
type=SenseTime,
path='nova-ptc-xl-v1',
key='xxxxxxxxxxxxxx',
url='xxxxxxxxxxx',
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8,
parameters={
'temperature': 0.8,
'top_p': 0.7,
'max_new_tokens': 1024,
'repetition_penalty': 1.05,
'know_ids': [],
'stream': True,
'user': '#*#***TestUser***#*#',
'knowledge_config': {
'control_level': 'normal',
'knowledge_base_result': False,
'online_search_result': False
}
}
)
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = 'outputs/api_sensetime/'

View File

@ -1,51 +0,0 @@
from mmengine.config import read_base
from opencompass.models.xunfei_api import XunFei
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
# from opencompass.configs.datasets.collections.chat_medium import datasets
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
models = [
dict(
abbr='Spark-v1-1',
type=XunFei,
appid='xxxx',
path='ws://spark-api.xf-yun.com/v1.1/chat',
api_secret = 'xxxxxxx',
api_key = 'xxxxxxx',
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
dict(
abbr='Spark-v3-1',
type=XunFei,
appid='xxxx',
domain='generalv3',
path='ws://spark-api.xf-yun.com/v3.1/chat',
api_secret = 'xxxxxxxx',
api_key = 'xxxxxxxxx',
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = 'outputs/api_xunfei/'

View File

@ -1,48 +0,0 @@
from mmengine.config import read_base
from opencompass.models import ZhiPuAI
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
# from opencompass.configs.datasets.collections.chat_medium import datasets
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
# needs a special postprocessor for all
# except 'gsm8k' and 'strategyqa'
from opencompass.utils import general_eval_wrapper_postprocess
for _dataset in datasets:
if _dataset['abbr'] not in ['gsm8k', 'strategyqa']:
if hasattr(_dataset['eval_cfg'], 'pred_postprocessor'):
_dataset['eval_cfg']['pred_postprocessor']['postprocess'] = _dataset['eval_cfg']['pred_postprocessor']['type']
_dataset['eval_cfg']['pred_postprocessor']['type'] = general_eval_wrapper_postprocess
else:
_dataset['eval_cfg']['pred_postprocessor'] = {'type': general_eval_wrapper_postprocess}
models = [
dict(
abbr='chatglm_pro',
type=ZhiPuAI,
path='chatglm_pro',
key='xxxxxxxxxxxx',
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8),
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = 'outputs/api_zhipu/'

View File

@ -1,67 +0,0 @@
from mmengine.config import read_base
from opencompass.models import ZhiPuV2AI
from opencompass.partitioners import NaivePartitioner
from opencompass.runners.local_api import LocalAPIRunner
from opencompass.tasks import OpenICLInferTask
with read_base():
# from opencompass.configs.datasets.collections.chat_medium import datasets
from opencompass.configs.summarizers.medium import summarizer
from opencompass.configs.datasets.ceval.ceval_gen import ceval_datasets
datasets = [
*ceval_datasets,
]
# needs a special postprocessor for all
# except 'gsm8k' and 'strategyqa'
from opencompass.utils import general_eval_wrapper_postprocess
for _dataset in datasets:
if _dataset['abbr'] not in ['gsm8k', 'strategyqa']:
if hasattr(_dataset['eval_cfg'], 'pred_postprocessor'):
_dataset['eval_cfg']['pred_postprocessor']['postprocess'] = _dataset['eval_cfg']['pred_postprocessor']['type']
_dataset['eval_cfg']['pred_postprocessor']['type'] = general_eval_wrapper_postprocess
else:
_dataset['eval_cfg']['pred_postprocessor'] = {'type': general_eval_wrapper_postprocess}
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
],
)
models = [
dict(
abbr='glm4_notools',
type=ZhiPuV2AI,
path='glm-4',
key='xxxxxx',
generation_kwargs={
'tools': [
{
'type': 'web_search',
'web_search': {
'enable': False # turn off the search
}
}
]
},
meta_template=api_meta_template,
query_per_second=1,
max_out_len=2048,
max_seq_len=2048,
batch_size=8)
]
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(
type=LocalAPIRunner,
max_num_workers=2,
concurrent_users=2,
task=dict(type=OpenICLInferTask)),
)
work_dir = 'outputs/api_zhipu_v2/'

View File

@ -1,22 +0,0 @@
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.mmlu.mmlu_gen_4d595a import mmlu_datasets
from opencompass.configs.datasets.cmmlu.cmmlu_gen_c13365 import cmmlu_datasets
from opencompass.configs.datasets.ceval.ceval_gen_5f30c7 import ceval_datasets
from opencompass.configs.datasets.GaokaoBench.GaokaoBench_no_subjective_gen_4c31db import GaokaoBench_datasets
from opencompass.configs.datasets.triviaqa.triviaqa_wiki_1shot_gen_bc5f21 import triviaqa_datasets
from opencompass.configs.datasets.nq.nq_open_1shot_gen_2e45e5 import nq_datasets
from opencompass.configs.datasets.race.race_gen_69ee4f import race_datasets
from opencompass.configs.datasets.winogrande.winogrande_5shot_gen_b36770 import winogrande_datasets
from opencompass.configs.datasets.hellaswag.hellaswag_10shot_gen_e42710 import hellaswag_datasets
from opencompass.configs.datasets.bbh.bbh_gen_2879b0 import bbh_datasets
from opencompass.configs.datasets.gsm8k.gsm8k_gen_1d7fe4 import gsm8k_datasets
from opencompass.configs.datasets.math.math_0shot_gen_393424 import math_datasets
from opencompass.configs.datasets.TheoremQA.TheoremQA_5shot_gen_6f0af8 import TheoremQA_datasets
from opencompass.configs.datasets.humaneval.humaneval_gen_8e312c import humaneval_datasets
from opencompass.configs.datasets.mbpp.sanitized_mbpp_gen_830460 import sanitized_mbpp_datasets
from opencompass.configs.datasets.gpqa.gpqa_gen_4baadb import gpqa_datasets
from opencompass.configs.datasets.IFEval.IFEval_gen_3321a3 import ifeval_datasets
datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])

View File

@ -1,55 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccContaminationEvaluator
from opencompass.datasets import ARCDatasetClean as ARCDataset
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_c_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'A':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textA}')
], ),
'B':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textB}')
], ),
'C':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textC}')
], ),
'D':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textD}')
], ),
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_c_eval_cfg = dict(evaluator=dict(type=AccContaminationEvaluator),
analyze_contamination=True)
ARC_c_datasets = [
dict(
type=ARCDataset,
abbr='ARC-c-test',
path='opencompass/ai2_arc-test',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg)
]

View File

@ -1,53 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
from opencompass.utils.text_postprocessors import first_option_postprocess, match_answer_pattern
QUERY_TEMPLATE = """
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.
{question}
A. {textA}
B. {textB}
C. {textC}
D. {textD}
""".strip()
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_c_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt=QUERY_TEMPLATE)
], ),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
ARC_c_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_option_postprocess, options='ABCD'),
)
ARC_c_datasets = [
dict(
abbr='ARC-c',
type=ARCDataset,
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg,
)
]

View File

@ -1,48 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever, FixKRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
from opencompass.utils.text_postprocessors import first_capital_postprocess
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey',
)
ARC_c_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template=dict(
begin='</E>',
round=[
dict(
role='HUMAN',
prompt='Question: {question}\nA. {textA}\nB. {textB}\nC. {textC}\nD. {textD}\nAnswer:',
),
dict(role='BOT', prompt='{answerKey}'),
],
),
ice_token='</E>',
),
retriever=dict(type=FixKRetriever, fix_id_list=[0, 2, 4, 6, 8]),
inferencer=dict(type=GenInferencer, max_out_len=50),
)
ARC_c_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_capital_postprocess),
)
ARC_c_datasets = [
dict(
abbr='ARC-c',
type=ARCDataset,
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg,
)
]

View File

@ -1,63 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever, FixKRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey',
)
ARC_c_infer_cfg = dict(
ice_template=dict(
type=PromptTemplate,
template={
'A': dict(
begin='</E>',
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textA}'),
],
),
'B': dict(
begin='</E>',
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textB}'),
],
),
'C': dict(
begin='</E>',
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textC}'),
],
),
'D': dict(
begin='</E>',
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textD}'),
],
),
},
ice_token='</E>',
),
retriever=dict(type=FixKRetriever, fix_id_list=[0, 2, 4, 6, 8]),
inferencer=dict(type=PPLInferencer),
)
ARC_c_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_c_datasets = [
dict(
type=ARCDataset,
abbr='ARC-c',
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg,
)
]

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .ARC_c_gen_1e0de5 import ARC_c_datasets # noqa: F401, F403

View File

@ -1,44 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
from opencompass.utils.text_postprocessors import first_option_postprocess
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_c_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt=
'Question: {question}\nA. {textA}\nB. {textB}\nC. {textC}\nD. {textD}\nAnswer:'
)
], ),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
ARC_c_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_option_postprocess, options='ABCD'),
)
ARC_c_datasets = [
dict(
abbr='ARC-c',
type=ARCDataset,
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg,
)
]

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .ARC_c_ppl_a450bd import ARC_c_datasets # noqa: F401, F403

View File

@ -1,37 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_c_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
opt: dict(
round=[
dict(role='HUMAN', prompt=f'{{question}}\nA. {{textA}}\nB. {{textB}}\nC. {{textC}}\nD. {{textD}}'),
dict(role='BOT', prompt=f'Answer: {opt}'),
]
) for opt in ['A', 'B', 'C', 'D']
},
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_c_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_c_datasets = [
dict(
type=ARCDataset,
abbr='ARC-c',
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg)
]

View File

@ -1,54 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_c_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'A':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textA}')
], ),
'B':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textB}')
], ),
'C':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textC}')
], ),
'D':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textD}')
], ),
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_c_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_c_datasets = [
dict(
type=ARCDataset,
abbr='ARC-c',
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg)
]

View File

@ -1,36 +0,0 @@
from mmengine.config import read_base
# with read_base():
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_c_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_c_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'A': 'Question: {question}\nAnswer: {textA}',
'B': 'Question: {question}\nAnswer: {textB}',
'C': 'Question: {question}\nAnswer: {textC}',
'D': 'Question: {question}\nAnswer: {textD}'
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_c_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_c_datasets = [
dict(
type=ARCDataset,
abbr='ARC-c',
path='opencompass/ai2_arc-dev',
name='ARC-Challenge',
reader_cfg=ARC_c_reader_cfg,
infer_cfg=ARC_c_infer_cfg,
eval_cfg=ARC_c_eval_cfg)
]

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .ARC_e_gen_1e0de5 import ARC_e_datasets # noqa: F401, F403

View File

@ -1,44 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
from opencompass.utils.text_postprocessors import first_option_postprocess
ARC_e_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_e_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(
role='HUMAN',
prompt=
'Question: {question}\nA. {textA}\nB. {textB}\nC. {textC}\nD. {textD}\nAnswer:'
)
], ),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
ARC_e_eval_cfg = dict(
evaluator=dict(type=AccEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=first_option_postprocess, options='ABCD'),
)
ARC_e_datasets = [
dict(
abbr='ARC-e',
type=ARCDataset,
path='opencompass/ai2_arc-easy-dev',
name='ARC-Easy',
reader_cfg=ARC_e_reader_cfg,
infer_cfg=ARC_e_infer_cfg,
eval_cfg=ARC_e_eval_cfg,
)
]

View File

@ -1,4 +0,0 @@
from mmengine.config import read_base
with read_base():
from .ARC_e_ppl_a450bd import ARC_e_datasets # noqa: F401, F403

View File

@ -1,37 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_e_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_e_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
opt: dict(
round=[
dict(role='HUMAN', prompt=f'{{question}}\nA. {{textA}}\nB. {{textB}}\nC. {{textC}}\nD. {{textD}}'),
dict(role='BOT', prompt=f'Answer: {opt}'),
]
) for opt in ['A', 'B', 'C', 'D']
},
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_e_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_e_datasets = [
dict(
type=ARCDataset,
abbr='ARC-e',
path='opencompass/ai2_arc-easy-dev',
name='ARC-Easy',
reader_cfg=ARC_e_reader_cfg,
infer_cfg=ARC_e_infer_cfg,
eval_cfg=ARC_e_eval_cfg)
]

View File

@ -1,54 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_e_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_e_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'A':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textA}')
], ),
'B':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textB}')
], ),
'C':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textC}')
], ),
'D':
dict(
round=[
dict(role='HUMAN', prompt='Question: {question}\nAnswer: '),
dict(role='BOT', prompt='{textD}')
], ),
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_e_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_e_datasets = [
dict(
type=ARCDataset,
abbr='ARC-e',
path='opencompass/ai2_arc-easy-dev',
name='ARC-Easy',
reader_cfg=ARC_e_reader_cfg,
infer_cfg=ARC_e_infer_cfg,
eval_cfg=ARC_e_eval_cfg)
]

View File

@ -1,34 +0,0 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccEvaluator
from opencompass.datasets import ARCDataset
ARC_e_reader_cfg = dict(
input_columns=['question', 'textA', 'textB', 'textC', 'textD'],
output_column='answerKey')
ARC_e_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
'A': 'Question: {question}\nAnswer: {textA}',
'B': 'Question: {question}\nAnswer: {textB}',
'C': 'Question: {question}\nAnswer: {textC}',
'D': 'Question: {question}\nAnswer: {textD}'
}),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer))
ARC_e_eval_cfg = dict(evaluator=dict(type=AccEvaluator))
ARC_e_datasets = [
dict(
type=ARCDataset,
abbr='ARC-e',
path='opencompass/ai2_arc-easy-dev',
name='ARC-Easy',
reader_cfg=ARC_e_reader_cfg,
infer_cfg=ARC_e_infer_cfg,
eval_cfg=ARC_e_eval_cfg)
]

View File

@ -1,164 +0,0 @@
# CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024]
[![arXiv](https://img.shields.io/badge/arXiv-2403.14112-b31b1b.svg)](https://arxiv.org/abs/2403.14112)
[![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE)
<div align="center">
📃[Paper](https://arxiv.org/abs/2403.14112)
🏰[Project Page](https://opendatalab.github.io/CHARM/)
🏆[Leaderboard](https://opendatalab.github.io/CHARM/leaderboard.html)
✨[Findings](https://opendatalab.github.io/CHARM/findings.html)
</div>
<div align="center">
📖 <a href="./README_ZH.md"> 中文</a> | <a href="./README.md">English</a>
</div>
## Dataset Description
**CHARM** is the first benchmark for comprehensively and in-depth evaluating the commonsense reasoning ability of large language models (LLMs) in Chinese, which covers both globally known and Chinese-specific commonsense. In addition, the CHARM can evaluate the LLMs' memorization-independent reasoning abilities and analyze the typical errors.
## Comparison of commonsense reasoning benchmarks
<html lang="en">
<table align="center">
<thead class="fixed-header">
<tr>
<th>Benchmarks</th>
<th>CN-Lang</th>
<th>CSR</th>
<th>CN-specifics</th>
<th>Dual-Domain</th>
<th>Rea-Mem</th>
</tr>
</thead>
<tr>
<td>Most benchmarks in <a href="https://arxiv.org/abs/2302.04752"> davis2023benchmarks</a></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/1809.05053"> XNLI</a>, <a
href="https://arxiv.org/abs/2005.00333">XCOPA</a>,<a
href="https://arxiv.org/abs/2112.10668">XStoryCloze</a></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/2007.08124">LogiQA</a>, <a
href="https://arxiv.org/abs/2004.05986">CLUE</a>, <a
href="https://arxiv.org/abs/2306.09212">CMMLU</a></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/2312.12853">CORECODE</a> </td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><strong><a href="https://arxiv.org/abs/2403.14112">CHARM (ours)</a> </strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
</tr>
</table>
"CN-Lang" indicates the benchmark is presented in Chinese language. "CSR" means the benchmark is designed to focus on <strong>C</strong>ommon<strong>S</strong>ense <strong>R</strong>easoning. "CN-specific" indicates the benchmark includes elements that are unique to Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and format. "Rea-Mem" indicates the benchmark includes closely-interconnected <strong>rea</strong>soning and <strong>mem</strong>orization tasks.
## 🛠️ How to Use
Below are the steps for quickly downloading CHARM and using OpenCompass for evaluation.
### 1. Download CHARM
```bash
git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}
cd ${path_to_opencompass}
mkdir data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM
```
### 2. Run Inference and Evaluation
```bash
cd ${path_to_opencompass}
# modify config file `configs/eval_charm_rea.py`: uncomment or add models you want to evaluate
python run.py configs/eval_charm_rea.py -r --dump-eval-details
# modify config file `configs/eval_charm_mem.py`: uncomment or add models you want to evaluate
python run.py configs/eval_charm_mem.py -r --dump-eval-details
```
The inference and evaluation results would be in `${path_to_opencompass}/outputs`, like this:
```bash
outputs
├── CHARM_mem
│ └── chat
│ └── 20240605_151442
│ ├── predictions
│ │ ├── internlm2-chat-1.8b-turbomind
│ │ ├── llama-3-8b-instruct-lmdeploy
│ │ └── qwen1.5-1.8b-chat-hf
│ ├── results
│ │ ├── internlm2-chat-1.8b-turbomind_judged-by--GPT-3.5-turbo-0125
│ │ ├── llama-3-8b-instruct-lmdeploy_judged-by--GPT-3.5-turbo-0125
│ │ └── qwen1.5-1.8b-chat-hf_judged-by--GPT-3.5-turbo-0125
│   └── summary
│   └── 20240605_205020 # MEMORY_SUMMARY_DIR
│   ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Anachronisms_Judgment
│   ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Movie_and_Music_Recommendation
│   ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Sport_Understanding
│   ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Time_Understanding
│   └── judged-by--GPT-3.5-turbo-0125.csv # MEMORY_SUMMARY_CSV
└── CHARM_rea
└── chat
└── 20240605_152359
├── predictions
│ ├── internlm2-chat-1.8b-turbomind
│ ├── llama-3-8b-instruct-lmdeploy
│ └── qwen1.5-1.8b-chat-hf
├── results # REASON_RESULTS_DIR
│ ├── internlm2-chat-1.8b-turbomind
│ ├── llama-3-8b-instruct-lmdeploy
│ └── qwen1.5-1.8b-chat-hf
└── summary
├── summary_20240605_205328.csv # REASON_SUMMARY_CSV
└── summary_20240605_205328.txt
```
### 3. Generate Analysis Results
```bash
cd ${path_to_CHARM_repo}
# generate Table5, Table6, Table9 and Table10 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_reasoning.py ${REASON_SUMMARY_CSV}
# generate Figure3 and Figure9 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_mem_rea.py ${REASON_SUMMARY_CSV} ${MEMORY_SUMMARY_CSV}
# generate Table7, Table12, Table13 and Figure11 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/analyze_mem_indep_rea.py data/CHARM ${REASON_RESULTS_DIR} ${MEMORY_SUMMARY_DIR} ${MEMORY_SUMMARY_CSV}
```
## 🖊️ Citation
```bibtex
@misc{sun2024benchmarking,
title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
year={2024},
eprint={2403.14112},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

View File

@ -1,162 +0,0 @@
# CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations [ACL2024]
[![arXiv](https://img.shields.io/badge/arXiv-2403.14112-b31b1b.svg)](https://arxiv.org/abs/2403.14112)
[![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE)
<div align="center">
📃[Paper](https://arxiv.org/abs/2403.14112)
🏰[Project Page](https://opendatalab.github.io/CHARM/)
🏆[Leaderboard](https://opendatalab.github.io/CHARM/leaderboard.html)
✨[Findings](https://opendatalab.github.io/CHARM/findings.html)
</div>
<div align="center">
📖 <a href="./README_ZH.md"> 中文</a> | <a href="./README.md">English</a>
</div>
## 数据集介绍
**CHARM** 是首个全面深入评估大型语言模型LLMs在中文常识推理能力的基准测试它覆盖了国际普遍认知的常识以及独特的中国文化常识。此外CHARM 还可以评估 LLMs 独立于记忆的推理能力,并分析其典型错误。
## 与其他常识推理评测基准的比较
<html lang="en">
<table align="center">
<thead class="fixed-header">
<tr>
<th>基准</th>
<th>汉语</th>
<th>常识推理</th>
<th>中国特有知识</th>
<th>中国和世界知识域</th>
<th>推理和记忆的关系</th>
</tr>
</thead>
<tr>
<td><a href="https://arxiv.org/abs/2302.04752"> davis2023benchmarks</a> 中提到的基准</td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/1809.05053"> XNLI</a>, <a
href="https://arxiv.org/abs/2005.00333">XCOPA</a>,<a
href="https://arxiv.org/abs/2112.10668">XStoryCloze</a></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/2007.08124">LogiQA</a>,<a
href="https://arxiv.org/abs/2004.05986">CLUE</a>, <a
href="https://arxiv.org/abs/2306.09212">CMMLU</a></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/2312.12853">CORECODE</a> </td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
<td><strong><span style="color: red;">&#x2718;</span></strong></td>
</tr>
<tr>
<td><strong><a href="https://arxiv.org/abs/2403.14112">CHARM (ours)</a> </strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
<td><strong><span style="color: green;">&#x2714;</span></strong></td>
</tr>
</table>
## 🛠️ 如何使用
以下是快速下载 CHARM 并在 OpenCompass 上进行评估的步骤。
### 1. 下载 CHARM
```bash
git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}
cd ${path_to_opencompass}
mkdir data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM
```
### 2. 推理和评测
```bash
cd ${path_to_opencompass}
# 修改配置文件`configs/eval_charm_rea.py`: 将现有的模型取消注释,或者添加你想评测的模型
python run.py configs/eval_charm_rea.py -r --dump-eval-details
# 修改配置文件`configs/eval_charm_mem.py`: 将现有的模型取消注释,或者添加你想评测的模型
python run.py configs/eval_charm_mem.py -r --dump-eval-details
```
推理和评测的结果位于路径`${path_to_opencompass}/outputs`, 如下所示:
```bash
outputs
├── CHARM_mem
│ └── chat
│ └── 20240605_151442
│ ├── predictions
│ │ ├── internlm2-chat-1.8b-turbomind
│ │ ├── llama-3-8b-instruct-lmdeploy
│ │ └── qwen1.5-1.8b-chat-hf
│ ├── results
│ │ ├── internlm2-chat-1.8b-turbomind_judged-by--GPT-3.5-turbo-0125
│ │ ├── llama-3-8b-instruct-lmdeploy_judged-by--GPT-3.5-turbo-0125
│ │ └── qwen1.5-1.8b-chat-hf_judged-by--GPT-3.5-turbo-0125
│   └── summary
│   └── 20240605_205020 # MEMORY_SUMMARY_DIR
│   ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Anachronisms_Judgment
│   ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Movie_and_Music_Recommendation
│   ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Sport_Understanding
│   ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Time_Understanding
│   └── judged-by--GPT-3.5-turbo-0125.csv # MEMORY_SUMMARY_CSV
└── CHARM_rea
└── chat
└── 20240605_152359
├── predictions
│ ├── internlm2-chat-1.8b-turbomind
│ ├── llama-3-8b-instruct-lmdeploy
│ └── qwen1.5-1.8b-chat-hf
├── results # REASON_RESULTS_DIR
│ ├── internlm2-chat-1.8b-turbomind
│ ├── llama-3-8b-instruct-lmdeploy
│ └── qwen1.5-1.8b-chat-hf
└── summary
├── summary_20240605_205328.csv # REASON_SUMMARY_CSV
└── summary_20240605_205328.txt
```
### 3. 生成分析结果
```bash
cd ${path_to_CHARM_repo}
# 生成论文中的Table5, Table6, Table9 and Table10详见https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_reasoning.py ${REASON_SUMMARY_CSV}
# 生成论文中的Figure3 and Figure9详见https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_mem_rea.py ${REASON_SUMMARY_CSV} ${MEMORY_SUMMARY_CSV}
# 生成论文中的Table7, Table12, Table13 and Figure11详见https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/analyze_mem_indep_rea.py data/CHARM ${REASON_RESULTS_DIR} ${MEMORY_SUMMARY_DIR} ${MEMORY_SUMMARY_CSV}
```
## 🖊️ 引用
```bibtex
@misc{sun2024benchmarking,
title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
year={2024},
eprint={2403.14112},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

View File

@ -1,63 +0,0 @@
import os
from mmengine.config import read_base
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import CharmDataset, CharmMemoryEvaluator, LMEvaluator
with read_base():
from .charm_memory_settings import charm_memory_tasks, judge_system_prompts, dataset_path
charm_memory_datasets = []
for _task in charm_memory_tasks:
charm_memory_reader_cfg = dict(input_columns=['input'],
output_column='target')
charm_memory_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(role='HUMAN', prompt='请尽可能简短地回答下述问题。\n问题:{input}\n答:')
]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
)
if _task == 'Chinese_Movie_and_Music_Recommendation':
charm_memory_eval_cfg = dict(
evaluator=dict(type=CharmMemoryEvaluator),
pred_role='BOT',
)
else:
judge_system_prompt = judge_system_prompts[_task]
charm_memory_eval_cfg = dict(
evaluator=dict(
type=LMEvaluator,
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[
dict(
role='HUMAN',
prompt=judge_system_prompt +
"\n\n[Question]\n{input}\n[The Start of Reference Answer]\n{target}\n[The End of Reference Answer]\n\n[The Start of Assistant's Answer]\n{prediction}\n[The End of Assistant's Answer]" # noqa
),
]),
),
),
pred_role='BOT',
)
charm_memory_datasets.append(
dict(
type=CharmDataset,
path=dataset_path,
name=_task,
abbr='charm-memory-' + _task,
reader_cfg=charm_memory_reader_cfg,
infer_cfg=charm_memory_infer_cfg.copy(),
eval_cfg=charm_memory_eval_cfg.copy(),
))

View File

@ -1,31 +0,0 @@
import os
charm_memory_tasks = [
'Chinese_Anachronisms_Judgment',
'Chinese_Movie_and_Music_Recommendation',
'Chinese_Sport_Understanding',
'Chinese_Time_Understanding',
]
dataset_path = 'data/CHARM/memorization'
system_prompt_template = """Please act as an impartial judge, comparing the responses of the AI assistants to the reference answer and determining if the answers are correct.
You will receive the reference answer provided by a human and the responses of the AI assistants.
Your task is to judge whether the AI assistant's answers is correct.
{task_specific_prompt}
After providing your explanation, strictly output your final judgment in the following format: [正确] if the AI assistant's response is correct, “[错误]” if the AI assistant's response is incorrect.
"""
task_specific_prompts = {
'Chinese_Anachronisms_Judgment':
"If the provided reference answer is a list, the model's prediction is considered correct if it matches any item in the list.",
'Chinese_Time_Understanding':
"When evaluating the AI assistant's response regarding Chinese solar terms, as long as the AI assistant's response falls within the time frame provided in the reference answer, consider it correct.",
'Chinese_Sport_Understanding':
"If the provided reference answer is a list, the model's prediction is considered correct if it matches any item in the list."
}
judge_system_prompts = {
k: system_prompt_template.format(task_specific_prompt=v)
for k, v in task_specific_prompts.items()
}

View File

@ -1,50 +0,0 @@
import os
from mmengine.config import read_base
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import CharmDataset, charm_reason_postprocess, CharmReasonEvaluator
with read_base():
from .charm_reason_settings import charm_tasks, settings
settings = [s for s in settings if s[0] in ['ZH-CoT', 'EN-CoT']]
charm_reason_datasets = []
for _cot, _cot_prefix, dataset_path, fewshot_example_path, prompt_template in settings:
for _task in charm_tasks:
_fewshot_example_file = os.path.join(fewshot_example_path, f'{_task}_{_cot}.txt')
with open(_fewshot_example_file, 'r') as f:
_hint = f.read()
charm_reason_reader_cfg = dict(input_columns=['input'], output_column='target')
charm_reason_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[dict(role='HUMAN', prompt=prompt_template.format(_hint=_hint) + _cot_prefix)]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
)
charm_reason_eval_cfg = dict(
evaluator=dict(type=CharmReasonEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=charm_reason_postprocess),
dataset_postprocessor=dict(type=charm_reason_postprocess),
)
charm_reason_datasets.append(
dict(
type=CharmDataset,
path=dataset_path,
name=_task,
abbr='charm-reason-' + _task + '_' + _cot,
reader_cfg=charm_reason_reader_cfg,
infer_cfg=charm_reason_infer_cfg.copy(),
eval_cfg=charm_reason_eval_cfg.copy(),
)
)

View File

@ -1,49 +0,0 @@
import os
from mmengine.config import read_base
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets import CharmDataset, charm_reason_postprocess, CharmReasonEvaluator
with read_base():
from .charm_reason_settings import charm_tasks, settings
charm_reason_datasets = []
for _cot, _cot_prefix, dataset_path, fewshot_example_path, prompt_template in settings:
for _task in charm_tasks:
_fewshot_example_file = os.path.join(fewshot_example_path, f'{_task}_{_cot}.txt')
with open(_fewshot_example_file, 'r') as f:
_hint = f.read()
charm_reason_reader_cfg = dict(input_columns=['input'], output_column='target')
charm_reason_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(round=[dict(role='HUMAN', prompt=prompt_template.format(_hint=_hint) + _cot_prefix)]),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer, max_out_len=512),
)
charm_reason_eval_cfg = dict(
evaluator=dict(type=CharmReasonEvaluator),
pred_role='BOT',
pred_postprocessor=dict(type=charm_reason_postprocess),
dataset_postprocessor=dict(type=charm_reason_postprocess),
)
charm_reason_datasets.append(
dict(
type=CharmDataset,
path=dataset_path,
name=_task,
abbr='charm-reason-' + _task + '_' + _cot,
reader_cfg=charm_reason_reader_cfg,
infer_cfg=charm_reason_infer_cfg.copy(),
eval_cfg=charm_reason_eval_cfg.copy(),
)
)

View File

@ -1,57 +0,0 @@
import os
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.datasets import CharmDataset
from opencompass.openicl.icl_evaluator import AccwithDetailsEvaluator
charm_tasks = [
['Chinese_Anachronisms_Judgment', 'AB'],
['Chinese_Movie_and_Music_Recommendation', 'ABCD'],
['Chinese_Natural_Language_Inference', 'ABC'],
['Chinese_Reading_Comprehension', 'ABCD'],
['Chinese_Sequence_Understanding', 'ABCD'],
['Chinese_Sport_Understanding', 'AB'],
['Chinese_Time_Understanding', 'ABCD'],
['Global_Anachronisms_Judgment', 'AB'],
['Global_Movie_and_Music_Recommendation', 'ABCD'],
['Global_Natural_Language_Inference', 'ABC'],
['Global_Reading_Comprehension', 'ABCD'],
['Global_Sequence_Understanding', 'ABCD'],
['Global_Sport_Understanding', 'AB'],
['Global_Time_Understanding', 'ABCDEF'],
]
charm_reason_datasets = []
for task_name, options in charm_tasks:
with open(os.path.join(os.path.dirname(__file__), 'few-shot-examples', f'{task_name}_Direct.txt'), 'r') as f:
few_shot_example = f.read()
charm_reason_reader_cfg = dict(input_columns=['input'], output_column='target')
charm_reason_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template={
f'({opt})': f'{few_shot_example}\n{{input}}\nA: {opt}' for opt in options
},
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer),
)
charm_reason_eval_cfg = dict(evaluator=dict(type=AccwithDetailsEvaluator))
charm_reason_datasets.append(
dict(
type=CharmDataset,
abbr=f'charm-reason-{task_name}_Direct',
path=f'data/CHARM/reasoning',
name=task_name,
reader_cfg=charm_reason_reader_cfg,
infer_cfg=charm_reason_infer_cfg,
eval_cfg=charm_reason_eval_cfg,
)
)

View File

@ -1,36 +0,0 @@
import os
charm_tasks = [
'Chinese_Anachronisms_Judgment',
'Chinese_Movie_and_Music_Recommendation',
'Chinese_Natural_Language_Inference',
'Chinese_Reading_Comprehension',
'Chinese_Sequence_Understanding',
'Chinese_Sport_Understanding',
'Chinese_Time_Understanding',
'Global_Anachronisms_Judgment',
'Global_Movie_and_Music_Recommendation',
'Global_Natural_Language_Inference',
'Global_Reading_Comprehension',
'Global_Sequence_Understanding',
'Global_Sport_Understanding',
'Global_Time_Understanding',
]
XLT_template = 'Follow the given examples and answer the question.\n{_hint}\n\n I want you to act as an commonsense reasoning expert for Chinese. \n Request: {{input}}\n'
Translate_EN_template = 'Follow the given examples and answer the question.\n{_hint}\n\nQ: {{input}}\nA: '
Other_template = '请按照给定的例子回答问题。\n{_hint}\n\nQ{{input}}\nA'
data_dir = 'data/CHARM'
dataset_path_ZH = f'{data_dir}/reasoning'
dataset_path_TransEn = f'{data_dir}/reasoning_Translate-EN'
fewshot_example_path_ZH = os.path.join(os.path.dirname(__file__), 'few-shot-examples')
fewshot_example_path_TransEn = os.path.join(os.path.dirname(__file__), 'few-shot-examples_Translate-EN')
settings = [
('Direct', '', dataset_path_ZH, fewshot_example_path_ZH, Other_template),
('ZH-CoT', '让我们一步一步来思考。', dataset_path_ZH, fewshot_example_path_ZH, Other_template),
('EN-CoT', "Let's think step by step.", dataset_path_ZH, fewshot_example_path_ZH, Other_template),
('XLT', """You should retell the request in English.\nYou should do the answer step by step to choose the right answer.\nYou should step-by-step answer the request.\nYou should tell me the answer in this format 'So the answer is'.""", dataset_path_ZH, fewshot_example_path_ZH, XLT_template),
('Translate-EN', "Let's think step by step.", dataset_path_TransEn, fewshot_example_path_TransEn, Translate_EN_template),
]

View File

@ -1,22 +0,0 @@
识别给定陈述是否包含时代错误。
Q以下陈述是否包含时代错误请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择
李白用钢笔写诗。
选项:
(A) 是
(B) 否
A:(A)
Q以下陈述是否包含时代错误请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择
唐朝时期,人们可以在街头买到《红楼梦》。
选项:
(A) 是
(B) 否
A:(A)
Q以下陈述是否包含时代错误请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择
1998年10月1日人们手举五星红旗在天安门广场上庆祝国庆
选项:
(A) 是
(B) 否
A:(B)

View File

@ -1,25 +0,0 @@
识别给定陈述是否包含时代错误。
Q以下陈述是否包含时代错误请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择
李白用钢笔写诗。
选项:
(A) 是
(B) 否
ALet's think step by step.
This statement mentions "Li Bai", a poet from the Tang Dynasty in China. The "pen" mentioned in the statement is a modern device, so it is impossible for Li Bai to write poetry with a pen. This statement contains errors from the times. So the answer is (A).
Q以下陈述是否包含时代错误请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择
唐朝时期,人们可以在街头买到《红楼梦》。
选项:
(A) 是
(B) 否
ALet's think step by step.
This statement mentions "Dream of the Red Chamber", which was written by Qing Dynasty writer Cao Xueqin. There was no "Dream of the Red Chamber" during the Tang Dynasty, so this statement contains historical errors. So the answer is (A).
Q以下陈述是否包含时代错误请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择
1998年10月1日人们手举五星红旗在天安门广场上庆祝国庆
选项:
(A) 是
(B) 否
ALet's think step by step.
This statement mentions that in 1998, New China was established in 1949, and the five-star red flag was designated as the national flag of China. Therefore, October 1, 1998 is National Day, and it is reasonable for people to celebrate National Day at Tiananmen Square, excluding historical errors. So the answer is (B).

View File

@ -1,63 +0,0 @@
识别给定陈述是否包含时代错误。
I want you to act as a commonsense reasoning expert for Chinese.
Request以下陈述是否包含时代错误请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择
李白用钢笔写诗。
选项:
(A) 是
(B) 否
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
Request: How would a typical person answer each of the following statement whether contains an anachronism?
Li Bai writes poetry with a pen.
Option:
(A) Yes
(B) No
Step-by-step answer:
1.This statement mentions "Li Bai", a poet from the Tang Dynasty in China.
2.The pen mentioned in the statement is a modern device.
3.so it is impossible for Li Bai to write poetry with a pen. This statement contains errors from the times.
So the answer is (A).
I want you to act as a commonsense reasoning expert for Chinese.
Request以下陈述是否包含时代错误请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择
唐朝时期,人们可以在街头买到《红楼梦》。
选项:
(A) 是
(B) 否
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
Request: How would a typical person answer each of the following statement whether contains an anachronism?
During the Tang Dynasty, people could buy "Dream of the Red Chamber" on the streets.
Option:
(A) Yes
(B) No
Step-by-step answer:
1.This statement mentions "Dream of the Red Chamber", which was written by Qing Dynasty writer Cao Xueqin
2.During the Tang Dynasty, there was no "Dream of the Red Chamber", so this statement contains historical errors.
So the answer is (A).
I want you to act as a commonsense reasoning expert for Chinese.
Request以下陈述是否包含时代错误请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择
1998年10月1日人们手举五星红旗在天安门广场上庆祝国庆
选项:
(A) 是
(B) 否
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
Request: How would a typical person answer each of the following statement whether contains an anachronism?
On October 1, 1998, people held five-star red flags and celebrated National Day on Tiananmen Square
Option:
(A) Yes
(B) No
Step-by-step answer:
1.This statement mentions that in 1998, New China was established in 1949
2.The Five Star Red Flag was designated as the national flag of China in 1949.
3.So October 1st, 1998 is National Day, and it is reasonable for people to celebrate National Day at Tiananmen Square, so the statement does not contain any historical errors.
So the answer is (B).

View File

@ -1,25 +0,0 @@
识别给定陈述是否包含时代错误。
Q以下陈述是否包含时代错误请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择
李白用钢笔写诗。
选项:
(A) 是
(B) 否
A让我们一步一步来思考。
这个陈述提到了“李白”,他是中国唐朝时期的诗人。而陈述中提到的“钢笔”是现代设备,因此李白不可能使用钢笔写诗,该陈述包含时代错误。所以答案是(A)。
Q以下陈述是否包含时代错误请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择
唐朝时期,人们可以在街头买到《红楼梦》。
选项:
(A) 是
(B) 否
A让我们一步一步来思考。
这个陈述提到了《红楼梦》,《红楼梦》是清代作家曹雪芹所写,唐朝时还没有《红楼梦》,因此该陈述包含时代错误。所以答案是(A)。
Q以下陈述是否包含时代错误请选择正确选项。一个接受了义务教育、具备基本常识的人会如何选择
1998年10月1日人们手举五星红旗在天安门广场上庆祝国庆
选项:
(A) 是
(B) 否
A让我们一步一步来思考。
这个陈述提到了1998年新中国是1949年成立的五星红旗在1949年被确定为中国国旗因此1998年10月1日是国庆节人们在天安门庆祝国庆是合理的因此陈述不包含时代错误。所以答案是(B)。

View File

@ -1,25 +0,0 @@
给根据给定艺术作品清单,找出最类似的。
Q: 和这些电影《疯狂的外星人》、《斗牛》、《杀生》、《疯狂的石头》有共同点的电影是:
选项:
(A)《泰囧》
(B)《少年派》
(C)《江湖儿女》
(D)《湄公河行动》
A: (A)
Q: 和这些电影《红高梁》、《活着》、《大红灯笼高高挂》、《英雄》有共同点的电影是:
选项:
(A)《一个都不能少》
(B)《让子弹飞》
(C)《阿飞正传》
(D)《东邪西毒》
A: (A)
Q: 和这些歌曲《夜曲》、《本草纲目》、《听妈妈的话》、《七里香》有共同点的歌曲是:
选项:
(A)《双节棍》
(B)《年少有为》
(C)《浮夸》
(D)《三人游》
A: (A)

View File

@ -1,40 +0,0 @@
给根据给定艺术作品清单,找出最类似的。
Q: 和这些电影《疯狂的外星人》、《斗牛》、《杀生》、《疯狂的石头》有共同点的电影是:
选项:
(A)《泰囧》
(B)《少年派》
(C)《江湖儿女》
(D)《湄公河行动》
ALet's think step by step.
"Crazy Alien" is a comedy science fiction film directed by Ning Hao, written by Liu Cixin and Sun Xiaohang, and starring Huang Bo, Shen Teng, and Xu Zheng. It was released in 2019.
"Cow" is a dark comedy film directed by Guan Hu, starring Huang Bo and Yan Ni. It was released in 2009.
"Design of Death" is an absurd and suspenseful comedy film directed by Guan Hu, featuring Huang Bo, Simon Yam, Su Youpeng, and Yu Nan. It was released in 2012.
"Crazy Stone" is a dark comedy film directed by Ning Hao, featuring Guo Tao, Liu Hua, Lian Jin, Liu Gang, Xu Zheng, and Huang Bo. It was released in 2006.
These are all famous classic Chinese comedy films featuring Huang Bo. The only film among the options that seems to have something in common with these films is "Lost in Thailand" (directed by Xu Zheng, starring Huang Bo, Xu Zheng, and Wang Baoqiang), a comedy film released in 2012. So the answer is (A).
Q: 和这些电影《红高梁》、《活着》、《大红灯笼高高挂》、《英雄》有共同点的电影是:
选项:
(A)《一个都不能少》
(B)《让子弹飞》
(C)《阿飞正传》
(D)《东邪西毒》
A:Let's think step by step.
"Red Sorghum," directed by Zhang Yimou and starring Jiang Wen, Gong Li, and Teng Rujun, is a war drama film that was released in China in 1987.
"To Live," directed by Zhang Yimou and starring Ge You and Gong Li, is a drama film that was released in China in 1994.
"Raise the Red Lantern," directed by Zhang Yimou and starring Gong Li, He Saifei, Ma Jingwu, Cao Cuifen, Kong Lin, and Jin Shuyuan, is a drama film that was released in China in 1991.
"Hero," directed by Zhang Yimou and starring Jet Li, Tony Leung, Maggie Cheung, Chen Daoming, Zhang Ziyi, and Donnie Yen, is a wuxia film that was released in China in 2002.
These are all famous classic Chinese films directed by Zhang Yimou. The only film among the options that seems to have something in common with these films is "Not One Less" (directed by Zhang Yimou, starring Wei Minzhi and Zhang Huike), a drama film released in 1999. So the answer is (A).
Q: 和这些歌曲《夜曲》、《本草纲目》、《听妈妈的话》、《七里香》有共同点的歌曲是:
选项:
(A)《双节棍》
(B)《年少有为》
(C)《浮夸》
(D)《三人游》
ALet's think step by step.
"Nocturne" is a song performed by Jay Chou, with lyrics by Vincent Fang, music by Jay Chou, and arrangement by Michael Lin. It is included in Jay Chou's 2005 album "November's Chopin."
"Herbalist's Manual" is a song performed by Jay Chou, with lyrics by Vincent Fang, music by Jay Chou, and arrangement by Michael Lin. It is included in Jay Chou's 2006 album "Still Fantasy."
"Listen to Your Mother" is a song performed by Jay Chou, with lyrics and music by Jay Chou, arrangement by Michael Lin and Hong Jingyao. It is included in Jay Chou's 2006 album "Still Fantasy."
"Common Jasmine Orange" is a song performed by Jay Chou, with lyrics by Vincent Fang, music by Jay Chou, and arrangement by Chung Hsin-min. It is included in Jay Chou's self-titled album "Common Jasmine Orange" released in 2004.
These are all famous pop songs performed by Jay Chou. The only song among the options that seems to have something in common with these songs is "Nunchucks" (performed by Jay Chou, composed by Jay Chou, lyrics by Vincent Fang, arrangement by Chung Hsin-min, included in Jay Chou's 2001 album "Fantasy"). So the answer is (A).

View File

@ -1,76 +0,0 @@
给根据给定艺术作品清单,找出最类似的。
I want you to act as a commonsense reasoning expert for Chinese.
Request和这些电影《疯狂的外星人》、《斗牛》、《杀生》、《疯狂的石头》有共同点的电影是
选项:
(A)《泰囧》
(B)《少年派》
(C)《江湖儿女》
(D)《湄公河行动》
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestThe movie that has something in common with these movies Crazy Aliens, Bullitt, Killjoys and Crazy Stone is:
Options.
(A)Lost in Thailand
(B)The Young and the Restless
(C)The Children of the River and the Lake
(D)The Mekong Operation
Step-by-step answer:
1."Crazy Alien" is a comedy science fiction film directed by Ning Hao, written by Liu Cixin and Sun Xiaohang, and starring Huang Bo, Shen Teng, and Xu Zheng. It was released in 2019.
2."Cow" is a dark comedy film directed by Guan Hu, starring Huang Bo and Yan Ni. It was released in 2009.
3."Design of Death" is an absurd and suspenseful comedy film directed by Guan Hu, featuring Huang Bo, Simon Yam, Su Youpeng, and Yu Nan. It was released in 2012.
4."Crazy Stone" is a dark comedy film directed by Ning Hao, featuring Guo Tao, Liu Hua, Lian Jin, Liu Gang, Xu Zheng, and Huang Bo. It was released in 2006.
5.These are all famous classic Chinese comedy films featuring Huang Bo. The only film among the options that seems to have something in common with these films is "Lost in Thailand" (directed by Xu Zheng, starring Huang Bo, Xu Zheng, and Wang Baoqiang), a comedy film released in 2012.
So the answer is (A).
I want you to act as a commonsense reasoning expert for Chinese.
Request和这些电影《红高梁》、《活着》、《大红灯笼高高挂》、《英雄》有共同点的电影是
选项:
(A)《一个都不能少》
(B)《让子弹飞》
(C)《阿飞正传》
(D)《东邪西毒》
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestThe movie that has something in common with these movies 'Red High Beam', 'Alive', 'Big Red Lantern' and 'Hero' is:
Option.
(A) 'Not One Less'
(B)Let the Bullets Fly
(C)The Legend of Ah Fei
(D)East is East, West is West
Step-by-step answer:
1."Red Sorghum," directed by Zhang Yimou and starring Jiang Wen, Gong Li, and Teng Rujun, is a war drama film that was released in China in 1987.
2."To Live," directed by Zhang Yimou and starring Ge You and Gong Li, is a drama film that was released in China in 1994.
3."Raise the Red Lantern," directed by Zhang Yimou and starring Gong Li, He Saifei, Ma Jingwu, Cao Cuifen, Kong Lin, and Jin Shuyuan, is a drama film that was released in China in 1991.
4."Hero," directed by Zhang Yimou and starring Jet Li, Tony Leung, Maggie Cheung, Chen Daoming, Zhang Ziyi, and Donnie Yen, is a wuxia film that was released in China in 2002.
5.These are all famous classic Chinese films directed by Zhang Yimou. The only film among the options that seems to have something in common with these films is "Not One Less" (directed by Zhang Yimou, starring Wei Minzhi and Zhang Huike), a drama film released in 1999.
So the answer is (A).
I want you to act as a commonsense reasoning expert for Chinese.
Request和这些歌曲《夜曲》、《本草纲目》、《听妈妈的话》、《七里香》有共同点的歌曲是
选项:
(A)《双节棍》
(B)《年少有为》
(C)《浮夸》
(D)《三人游》
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestThe songs that have something in common with these songs "Nocturne", "Materia Medica", "Listen to Your Mother" and "Seven Miles" are:
Option.
(A) 'Nunchaku'
(B) 'The Young and the Restless'
(C) 'Pomp and Circumstance'
(D) "Three's a Crowd
Step-by-step answer:
1."Nocturne" is a song performed by Jay Chou, with lyrics by Vincent Fang, music by Jay Chou, and arrangement by Michael Lin. It is included in Jay Chou's 2005 album "November's Chopin."
2."Herbalist's Manual" is a song performed by Jay Chou, with lyrics by Vincent Fang, music by Jay Chou, and arrangement by Michael Lin. It is included in Jay Chou's 2006 album "Still Fantasy."
3."Listen to Your Mother" is a song performed by Jay Chou, with lyrics and music by Jay Chou, arrangement by Michael Lin and Hong Jingyao. It is included in Jay Chou's 2006 album "Still Fantasy."
4."Common Jasmine Orange" is a song performed by Jay Chou, with lyrics by Vincent Fang, music by Jay Chou, and arrangement by Chung Hsin-min. It is included in Jay Chou's self-titled album "Common Jasmine Orange" released in 2004.
5.These are all famous pop songs performed by Jay Chou. The only song among the options that seems to have something in common with these songs is "Nunchucks" (performed by Jay Chou, composed by Jay Chou, lyrics by Vincent Fang, arrangement by Chung Hsin-min, included in Jay Chou's 2001 album "Fantasy").
So the answer is (A).

View File

@ -1,40 +0,0 @@
给根据给定艺术作品清单,找出最类似的。
Q: 和这些电影《疯狂的外星人》、《斗牛》、《杀生》、《疯狂的石头》有共同点的电影是:
选项:
(A)《泰囧》
(B)《少年派》
(C)《江湖儿女》
(D)《湄公河行动》
A: 让我们一步一步来思考。
《疯狂的外星人》是由宁浩执导刘慈欣、孙小杭编剧黄渤、沈腾、徐峥等主演的喜剧科幻片2019年上映。
《斗牛》是由管虎执导黄渤、闫妮等主演的黑色喜剧电影2009年上映。
《杀生》是由管虎执导黄渤、任达华、苏有朋、余男等联袂主演的荒诞悬疑喜剧片2012年上映。
《疯狂的石头》是宁浩执导郭涛、刘桦、连晋、刘刚、徐峥、黄渤等出演的黑色喜剧片2006年上映。
这些都是有黄渤出演的著名经典中国喜剧电影在所有选项中唯一与这些电影有相同点的电影似乎是《泰囧》徐峥执导黄渤、徐峥、王宝强主演的喜剧片2012年上映。所以答案是(A)。
Q: 和这些电影《红高梁》、《活着》、《大红灯笼高高挂》、《英雄》有共同点的电影是:
选项:
(A)《一个都不能少》
(B)《让子弹飞》
(C)《阿飞正传》
(D)《东邪西毒》
A: 让我们一步一步来思考。
《红高粱》由张艺谋执导姜文、巩俐、滕汝骏等主演的战争文艺片1987年在中国上映。
《活着》是由张艺谋执导葛优、巩俐等主演的剧情片1994年在中国上映。
《大红灯笼高高挂》是由张艺谋执导巩俐、何赛飞、马精武、曹翠芬、孔琳、金淑媛等主演的剧情片1991年在中国上映。
《英雄》是张艺谋执导由李连杰、梁朝伟、张曼玉、陈道明、章子怡及甄子丹主演的的武侠电影2002年在中国上映。
这些都是由张艺谋执导的著名经典中国电影在所有选项中唯一与这些电影有相同点的电影似乎是《一个都不能少》张艺谋执导魏敏芝、张慧科主演的剧情电影1999年上映。所以答案是(A)。
Q: 和这些歌曲《夜曲》、《本草纲目》、《听妈妈的话》、《七里香》有共同点的歌曲是:
选项:
(A)《双节棍》
(B)《年少有为》
(C)《浮夸》
(D)《三人游》
A: 让我们一步一步来思考。
《夜曲》是周杰伦演唱的一首歌曲由方文山作词周杰伦作曲林迈可编曲收录在周杰伦2005年发行的专辑《11月的萧邦》中
《本草纲目》是周杰伦演唱的一首歌曲由方文山作词周杰伦作曲林迈可编曲收录在周杰伦2006年发行的专辑《依然范特西》中。
《听妈妈的话》是周杰伦演唱的一首歌曲由周杰伦作词、作曲林迈可、洪敬尧编曲收录在周杰伦2006年发行的专辑《依然范特西》中。
《七里香》是周杰伦演唱的一首歌曲由方文山作词周杰伦谱曲钟兴民编曲收录在周杰伦2004年发行的同名专辑《七里香》中。
这些都是由周杰伦演唱的中国著名流行音乐歌曲在所有选项中唯一与这些歌曲有相同点的歌曲似乎是《双节棍》由周杰伦演唱由周杰伦作曲方文山作词钟兴民编曲收录于周杰伦2001年发行的专辑《范特西》中。所以答案是(A)。

View File

@ -1,25 +0,0 @@
请根据题目中两句话的关系选择正确答案。
Q:语句一:小明刚刚去什刹海滑冰
语句二:北京现在是冬季
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A:(A)
Q:语句一:下周,一股强降水将影响整个河北省
语句二:下周,上海天气很好
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A:(C)
Q:语句一:昨天,小明在山上看落日,感叹道:"夕阳无限好,只是近黄昏"
语句二:昨天下雨,小明没有出门
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A:(B)

View File

@ -1,28 +0,0 @@
请根据题目中两句话的关系选择正确答案。
Q:语句一:小明刚刚去什刹海滑冰
语句二:北京现在是冬季
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A: Let's think step by step.
The first sentence mentions that Xiaoming goes to Shichahai for ice skating, which usually takes place in winter. Moreover, Shichahai is located in Beijing, which contains the message from the second sentence that it is currently winter. So the answer is (A).
Q:语句一:下周,一股强降水将影响整个河北省
语句二:下周,上海天气很好
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A: Let's think step by step.
These two sentences describe the weather conditions in two geographical locations, one in Hebei Province and the other in Shanghai. Hebei Province and Shanghai are geographically far apart, so the weather conditions in these two places may not necessarily be directly related. So, the relationship between these two sentences is irrelevant. So the answer is (C).
Q:语句一:昨天,小明在山上看落日,感叹道:"夕阳无限好,只是近黄昏"
语句二:昨天下雨,小明没有出门
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A: Let's think step by step.
The first sentence states that Xiaoming saw the sunset on the mountain yesterday, while the second sentence states that it rained yesterday and Xiaoming did not go out. There is a contradiction between these two sentences, because if Xiaoming had not gone out, he could not have seen the sunset on the mountain. So, the relationship between these two sentences is contradictory. So the answer is (B).

View File

@ -1,67 +0,0 @@
请根据题目中两句话的关系选择正确答案。
I want you to act as a commonsense reasoning expert for Chinese.
Request语句一小明刚刚去什刹海滑冰
语句二:北京现在是冬季
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestSentence 1: Xiaoming has just gone ice-skating in Shichahai
Sentence 2: It's winter in Beijing
What is the relationship between these two statements?
(A) Implicit
(B) Contradictory
(C) Irrelevant
Step-by-step answer:
1.The first sentence mentions that Xiaoming goes to Shichahai for ice skating, which usually takes place in winter.
2.Moreover, Shichahai is located in Beijing, which contains the message from the second sentence that it is currently winter.
So the answer is (A).
I want you to act as a commonsense reasoning expert for Chinese.
Request语句一下周一股强降水将影响整个河北省
语句二:下周,上海天气很好
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestSentence 1Next week, a heavy rainfall will affect the whole Hebei province
Sentence 2: Next week, the weather in Shanghai will be fine.
What is the relationship between these two statements?
(A) Implied
(B) Contradictory
(C) Irrelevant
Step-by-step answer:
1.These two sentences describe the weather conditions in two geographical locations, one in Hebei Province and the other in Shanghai.
2.Hebei Province and Shanghai are geographically far apart, so the weather conditions in these two places may not necessarily be directly related. So, the relationship between these two sentences is irrelevant.
So the answer is (C).
I want you to act as a commonsense reasoning expert for Chinese.
Request语句一昨天小明在山上看落日感叹道"夕阳无限好,只是近黄昏"
语句二:昨天下雨,小明没有出门
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestSentence 1: Yesterday, Xiao Ming watched the sunset on a hill and exclaimed, "The sunset is infinite, but it's just near dusk"
Sentence 2: Yesterday it rained and Ming didn't go out
What is the relationship between these two statements?
(A) implied
(B) contradictory
(C) Irrelevant
Step-by-step answer:
1.The first sentence states that Xiaoming saw the sunset on the mountain yesterday, while the second sentence states that it rained yesterday and Xiaoming did not go out.
2.There is a contradiction between these two sentences, because if Xiaoming had not gone out, he could not have seen the sunset on the mountain. So, the relationship between these two sentences is contradictory.
So the answer is (B).

View File

@ -1,28 +0,0 @@
请根据题目中两句话的关系选择正确答案。
Q:语句一:小明刚刚去什刹海滑冰
语句二:北京现在是冬季
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A:让我们一步一步来思考。
第一句话提到小明去什刹海滑冰,而滑冰通常在冬季进行,而且什刹海位于北京,这蕴含了第二句话的信息,即当前是冬季。所以答案是(A)。
Q:语句一:下周,一股强降水将影响整个河北省
语句二:下周,上海天气很好
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A:让我们一步一步来思考。
这两句话描述的是两个地理位置的天气情况,一个是河北省,一个是上海。河北省和上海在地理位置上相距较远,因此,这两个地方的天气情况并不一定有直接关联。所以,这两句话之间的关系是无关的。所以答案是(C)。
Q:语句一:昨天,小明在山上看落日,感叹道:"夕阳无限好,只是近黄昏"
语句二:昨天下雨,小明没有出门
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A:让我们一步一步来思考。
第一句话说小明昨天在山上看到了落日,而第二句话说昨天下雨,小明没有出门。这两句话之间存在矛盾,因为如果小明没有出门,那么他就不可能在山上看到落日。所以,这两句话之间的关系是矛盾的。所以答案是(B)。

View File

@ -1,23 +0,0 @@
请理解题目含义并选择正确答案。
Q:有些广东人不爱吃辣椒.因此,有些南方人不爱吃辣椒. 以下哪项能保证上述论证的成立?
(A) 有些广东人爱吃辣椒
(B) 爱吃辣椒的有些是南方人
(C) 所有的广东人都是南方人
(D) 有些广东人不爱吃辣椒也不爱吃甜食
A(C)
Q:唐卡是极富藏族文化特色的一种绘画形式,自吐蕃王朝兴起至今已有1300多年的历史,是雪域高原的文化瑰宝.它的题材除宗教外,还有历史和民俗内容,故又被称为了解西藏的“百科全书”.所以,想要了解西藏的历史,除了正襟危坐地阅读严谨但略显呆板的史书外,你还可以选择一种惬意和愉悦的方式--欣赏唐卡,与众多的古人对话,想象曾经的历史事件,体味藏族人丰富的精神世界,了解独特的藏族民俗,这是一个让历史变得立体可感的过程. 这段文字意在说明:
(A) 唐卡可以给大家提供一种惬意轻松的了解西藏的方式
(B) 唐卡中记录了独特的藏族民俗和曾经的历史事件
(C) 唐卡是了解西藏文化和历史的“百科全书”式的绘画形式
(D) 唐卡是极富藏族文化特色且历史悠久的一种绘画形式
A(A)
Q:“知人论世”作为一种文学批评的原则和方法,最早由战国时期的思想家孟子提出.孟子认为,后人要交结古人,只是读其诗书是不行的,还必须了解他们的为人行事以及他们的生活的时代,这样,才能读懂古人的诗书,才能和古人心契神交,成为知音. 对这段话的理解,不正确的是?
(A) 人的心灵是可以互通和共鸣的
(B) “知人论世”作为一种文学评论发沿用至今并显现了强大的生命力
(C) “知人论世”可以帮助后人交结古人和古人成为知音
(D) 了解古人和他所处的时代,有助于理解他的作品
A: (B)

View File

@ -1,25 +0,0 @@
请理解题目含义并选择正确答案。
Q:有些广东人不爱吃辣椒.因此,有些南方人不爱吃辣椒. 以下哪项能保证上述论证的成立?
(A) 有些广东人爱吃辣椒
(B) 爱吃辣椒的有些是南方人
(C) 所有的广东人都是南方人
(D) 有些广东人不爱吃辣椒也不爱吃甜食
A: Let's think step by step.
In this argument, we infer from "some Cantonese people do not like to eat chili peppers" that "some southerners do not like to eat chili peppers". The establishment of this reasoning depends on the relationship between Cantonese and Southerners. In order for this reasoning to be valid, we need to ensure that at least a portion of Cantonese people are from the south. Therefore, option (C) "All Cantonese are southerners" can ensure the validity of this argument. So the answer is (C).
Q:唐卡是极富藏族文化特色的一种绘画形式,自吐蕃王朝兴起至今已有1300多年的历史,是雪域高原的文化瑰宝.它的题材除宗教外,还有历史和民俗内容,故又被称为了解西藏的“百科全书”.所以,想要了解西藏的历史,除了正襟危坐地阅读严谨但略显呆板的史书外,你还可以选择一种惬意和愉悦的方式--欣赏唐卡,与众多的古人对话,想象曾经的历史事件,体味藏族人丰富的精神世界,了解独特的藏族民俗,这是一个让历史变得立体可感的过程. 这段文字意在说明:
(A) 唐卡可以给大家提供一种惬意轻松的了解西藏的方式
(B) 唐卡中记录了独特的藏族民俗和曾经的历史事件
(C) 唐卡是了解西藏文化和历史的“百科全书”式的绘画形式
(D) 唐卡是极富藏族文化特色且历史悠久的一种绘画形式
A: Let's think step by step.
It is explicitly mentioned in the article that besides reading rigorous but somewhat rigid historical books, appreciating thangkas is a comfortable and enjoyable way for people to converse with numerous ancient people, imagine past historical events, appreciate the rich spiritual world of Tibetans, and understand unique Tibetan customs. So the main purpose of this passage is (A) "Thangka can provide a comfortable and easy way for everyone to understand Xizang". So the answer is (A).
Q:“知人论世”作为一种文学批评的原则和方法,最早由战国时期的思想家孟子提出.孟子认为,后人要交结古人,只是读其诗书是不行的,还必须了解他们的为人行事以及他们的生活的时代,这样,才能读懂古人的诗书,才能和古人心契神交,成为知音. 对这段话的理解,不正确的是?
(A) 人的心灵是可以互通和共鸣的
(B) “知人论世”作为一种文学评论发沿用至今并显现了强大的生命力
(C) “知人论世”可以帮助后人交结古人和古人成为知音
(D) 了解古人和他所处的时代,有助于理解他的作品
A: Let's think step by step.
From this passage, we cannot see (B) that "understanding people and discussing the world" as a literary criticism has been used to this day and has shown strong vitality. Although "knowing people and discussing the world" was indeed proposed by the philosopher Mencius during the Warring States period as a principle and method of literary criticism, this passage does not mention that "knowing people and discussing the world" is still in use today, or that it has shown strong vitality. Therefore, option (B) is an incorrect understanding. So the answer is (B).

View File

@ -1,62 +0,0 @@
请理解题目含义并选择正确答案。
I want you to act as a commonsense reasoning expert for Chinese.
Request有些广东人不爱吃辣椒.因此,有些南方人不爱吃辣椒. 以下哪项能保证上述论证的成立?
(A) 有些广东人爱吃辣椒
(B) 爱吃辣椒的有些是南方人
(C) 所有的广东人都是南方人
(D) 有些广东人不爱吃辣椒也不爱吃甜食
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
Request: Some Cantonese people don't like chili peppers. Therefore, some southerners don't like chili peppers. Which of the following ensures the validity of the above argument?
(A) Some Cantonese people love chili peppers
(B) Some Southerners love chili peppers.
(C) All Cantonese are Southerners.
(D) Some Cantonese people do not love chili or sweets.
Step-by-step answer:
1.In this argument, we infer from "some Cantonese people do not like to eat chili peppers" that "some southerners do not like to eat chili peppers".
2.The establishment of this reasoning depends on the relationship between Cantonese and Southerners. In order for this reasoning to be valid, we need to ensure that at least a portion of Cantonese people are from the south. Therefore, option (C) "All Cantonese are southerners" can ensure the validity of this argument.
So the answer is (C).
I want you to act as a commonsense reasoning expert for Chinese.
Request唐卡是极富藏族文化特色的一种绘画形式,自吐蕃王朝兴起至今已有1300多年的历史,是雪域高原的文化瑰宝.它的题材除宗教外,还有历史和民俗内容,故又被称为了解西藏的“百科全书”.所以,想要了解西藏的历史,除了正襟危坐地阅读严谨但略显呆板的史书外,你还可以选择一种惬意和愉悦的方式--欣赏唐卡,与众多的古人对话,想象曾经的历史事件,体味藏族人丰富的精神世界,了解独特的藏族民俗,这是一个让历史变得立体可感的过程. 这段文字意在说明:
(A) 唐卡可以给大家提供一种惬意轻松的了解西藏的方式
(B) 唐卡中记录了独特的藏族民俗和曾经的历史事件
(C) 唐卡是了解西藏文化和历史的“百科全书”式的绘画形式
(D) 唐卡是极富藏族文化特色且历史悠久的一种绘画形式
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
Request: Thangka is a form of painting rich in Tibetan cultural characteristics, which has a history of more than 1300 years since the rise of the Tubo Dynasty and is a cultural treasure of the Snowy Plateau. It is a cultural treasure of the Snowy Plateau. Its subject matter is not only religious, but also historical and folklore content, so it is also known as the "encyclopedia" to understand Tibet. Therefore, if you want to understand the history of Tibet, in addition to sitting down and reading the strict but slightly dull history books, you can also choose a pleasant and enjoyable way - enjoying the thangka, conversing with many ancient people, imagining the historical events, savoring the rich spiritual world of the Tibetans, and understanding the unique folklore of the Tibetans, which is a process to make the history become a three-dimensional and palpable. This is a process of making history three-dimensional and palpable.
(A) Thangkas can provide a cozy and relaxing way to learn about Tibet.
(B) The thangkas are a unique record of Tibetan folklore and historical events.
(C) The thangka is an "encyclopedic" form of painting for understanding Tibetan culture and history.
(D) The thangka is a form of painting that is rich in Tibetan cultural characteristics and has a long history.
Step-by-step answer:
1.It is explicitly mentioned in the article that besides reading rigorous but somewhat rigid historical books, appreciating thangkas is a comfortable and enjoyable way for people to converse with numerous ancient people, imagine past historical events, appreciate the rich spiritual world of Tibetans, and understand unique Tibetan customs.
2.So the main purpose of this passage is (A) "Thangka can provide a comfortable and easy way for everyone to understand Xizang".
So the answer is (A).
I want you to act as a commonsense reasoning expert for Chinese.
Request“知人论世”作为一种文学批评的原则和方法,最早由战国时期的思想家孟子提出.孟子认为,后人要交结古人,只是读其诗书是不行的,还必须了解他们的为人行事以及他们的生活的时代,这样,才能读懂古人的诗书,才能和古人心契神交,成为知音. 对这段话的理解,不正确的是?
(A) 人的心灵是可以互通和共鸣的
(B) “知人论世”作为一种文学评论发沿用至今并显现了强大的生命力
(C) “知人论世”可以帮助后人交结古人和古人成为知音
(D) 了解古人和他所处的时代,有助于理解他的作品
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
Request: As a principle and method of literary criticism, the concept of "knowing people and discussing the world" was first proposed by Mencius, a thinker of the Warring States period. According to Mencius, in order to make friends with the ancients, it is not enough just to read their poems and books, but also to understand their behavior and the times in which they lived, so as to read and understand their poems and books, and to make friends with them and become their soulmates. What is the incorrect understanding of this passage?
(A) People's hearts can communicate and resonate with each other.
(B) "Knowing people and discussing the world" has been used as a kind of literary criticism and has shown a strong vitality.
(C) "Knowing people and discussing the world" can help the descendants to make friends with the ancients and the ancients to become their soulmates.
(D) Knowing about the ancients and their times helps to understand their works.
Step-by-step answer:
1.From this passage, we cannot see (B) that "understanding people and discussing the world" as a literary criticism has been used to this day and has shown strong vitality.
2.Although "knowing people and discussing the world" was indeed proposed by the philosopher Mencius during the Warring States period as a principle and method of literary criticism, this passage does not mention that "knowing people and discussing the world" is still in use today, or that it has shown strong vitality.
3.Therefore, option (B) is an incorrect understanding.
So the answer is (B).

View File

@ -1,26 +0,0 @@
请理解题目含义并选择正确答案。
Q:有些广东人不爱吃辣椒.因此,有些南方人不爱吃辣椒. 以下哪项能保证上述论证的成立?
(A) 有些广东人爱吃辣椒
(B) 爱吃辣椒的有些是南方人
(C) 所有的广东人都是南方人
(D) 有些广东人不爱吃辣椒也不爱吃甜食
A让我们一步一步来思考。
在这个论证中,我们从"有些广东人不爱吃辣椒"推断出"有些南方人不爱吃辣椒"。这个推理的成立依赖于广东人和南方人的关系。为了使这个推理有效,我们需要保证至少一部分广东人是南方人。因此,选项 (C) "所有的广东人都是南方人" 是能保证这个论证成立的。所以答案是(C)。
Q:唐卡是极富藏族文化特色的一种绘画形式,自吐蕃王朝兴起至今已有1300多年的历史,是雪域高原的文化瑰宝.它的题材除宗教外,还有历史和民俗内容,故又被称为了解西藏的“百科全书”.所以,想要了解西藏的历史,除了正襟危坐地阅读严谨但略显呆板的史书外,你还可以选择一种惬意和愉悦的方式--欣赏唐卡,与众多的古人对话,想象曾经的历史事件,体味藏族人丰富的精神世界,了解独特的藏族民俗,这是一个让历史变得立体可感的过程. 这段文字意在说明:
(A) 唐卡可以给大家提供一种惬意轻松的了解西藏的方式
(B) 唐卡中记录了独特的藏族民俗和曾经的历史事件
(C) 唐卡是了解西藏文化和历史的“百科全书”式的绘画形式
(D) 唐卡是极富藏族文化特色且历史悠久的一种绘画形式
A让我们一步一步来思考。
文中明确提到了,除了阅读严谨但略显呆板的史书外,欣赏唐卡是一种惬意和愉悦的方式,可以让人与众多的古人对话,想象曾经的历史事件,体味藏族人丰富的精神世界,了解独特的藏族民俗。所以这段文字的主要意图是 (A) "唐卡可以给大家提供一种惬意轻松的了解西藏的方式"。所以答案是(A)。
Q:“知人论世”作为一种文学批评的原则和方法,最早由战国时期的思想家孟子提出.孟子认为,后人要交结古人,只是读其诗书是不行的,还必须了解他们的为人行事以及他们的生活的时代,这样,才能读懂古人的诗书,才能和古人心契神交,成为知音. 对这段话的理解,不正确的是?
(A) 人的心灵是可以互通和共鸣的
(B) “知人论世”作为一种文学评论发沿用至今并显现了强大的生命力
(C) “知人论世”可以帮助后人交结古人和古人成为知音
(D) 了解古人和他所处的时代,有助于理解他的作品
A:让我们一步一步来思考。
从这段话中我们看不到B“知人论世”作为一种文学批评已经沿用至今并显示出强大的生命力。虽然“知人论世”确实是战国时期哲学家孟子提出的一种文学批评的原则和方法但这段话并没有提到“知人论世”在今天仍在使用也没有提到它已经显示出强大的生命力。因此选项B是一种错误的理解。所以答案是B

View File

@ -1,22 +0,0 @@
根据上下文选择正确答案
Q: 下列人物按时间先后顺序排序正确的是?选项:
(A) 秦始皇、诸葛亮、刘邦、白居易
(B) 诸葛亮、秦始皇、刘邦、白居易
(C) 秦始皇、刘邦、诸葛亮、白居易
(D) 白居易、诸葛亮、刘邦、秦始皇
A(C)
Q:下列描述年龄的词语按照年龄从小到大的顺序排序正确的是?选项:
(A) 不惑、而立、知天命、花甲
(B) 而立、不惑、知天命、花甲
(C) 花甲、知天命、而立、不惑
(D) 而立、花甲、不惑、知天命
A(B)
Q:下列制作老式棒棒糖的步骤正确的是?选项:
(A) 准备材料、将糖浆倒入模具、制作糖浆、冷却定型
(B) 准备材料、制作糖浆、将糖浆倒入模具、冷却定型
(C) 准备材料、将糖浆倒入模具、冷却定型、制作糖浆
(D) 准备材料、冷却定型、制作糖浆、将糖浆倒入模具
A(B)

View File

@ -1,25 +0,0 @@
根据上下文选择正确答案
Q: 下列人物按时间先后顺序排序正确的是?选项:
(A) 秦始皇、诸葛亮、刘邦、白居易
(B) 诸葛亮、秦始皇、刘邦、白居易
(C) 秦始皇、刘邦、诸葛亮、白居易
(D) 白居易、诸葛亮、刘邦、秦始皇
A: Let's think step by step.
There are four characters mentioned in the options, among which Qin Shi Huang is from the Qin Dynasty, Zhuge Liang is from the Three Kingdoms period, Liu Bang is from the Han Dynasty period, and Bai Juyi is from the Tang Dynasty period. They are sorted in chronological order as Qin Dynasty, Han Dynasty, Three Kingdoms period, and Tang Dynasty. Therefore, the characters are sorted in chronological order as Qin Shi Huang, Liu Bang, Zhuge Liang, and Bai Juyi. So the answer is (C).
Q:下列描述年龄的词语按照年龄从小到大的顺序排序正确的是?选项:
(A) 不惑、而立、知天命、花甲
(B) 而立、不惑、知天命、花甲
(C) 花甲、知天命、而立、不惑
(D) 而立、花甲、不惑、知天命
A: Let's think step by step.
The options mention four words that describe age, among which "Erli" refers to 30 years old, "Bu Fu" refers to 40 years old, "Zhi Tian Ming" refers to 50 years old, and "Hua Jia" refers to 60 years old. Therefore, in order of age, they are Erli, Bu Fu, Zhi Tian Ming, and Hua Jia. So the answer is (B).
Q:下列制作老式棒棒糖的步骤正确的是?选项:
(A) 准备材料、将糖浆倒入模具、制作糖浆、冷却定型
(B) 准备材料、制作糖浆、将糖浆倒入模具、冷却定型
(C) 准备材料、将糖浆倒入模具、冷却定型、制作糖浆
(D) 准备材料、冷却定型、制作糖浆、将糖浆倒入模具
A: Let's think step by step.
The title mentions the steps to make old-fashioned lollipops, and the options include "preparing materials", "pouring syrup into the mold", "making syrup", and "cooling and shaping". According to the steps to make old-fashioned lollipops, the first step should be to prepare the materials, then make syrup, pour syrup into the mold, and finally cool and shape. So the answer is (B).

View File

@ -1,62 +0,0 @@
根据上下文选择正确答案
I want you to act as a commonsense reasoning expert for Chinese.
Request: 下列人物按时间先后顺序排序正确的是?选项:
(A) 秦始皇、诸葛亮、刘邦、白居易
(B) 诸葛亮、秦始皇、刘邦、白居易
(C) 秦始皇、刘邦、诸葛亮、白居易
(D) 白居易、诸葛亮、刘邦、秦始皇
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
Request: The following characters are correctly ordered in chronological order? Options:
(A) Qin Shi Huang, Zhuge Liang, Liu Bang, Bai Juyi
(B) Zhuge Liang, Qin Shi Huang, Liu Bang, Bai Ju Yi
(C) Qin Shi Huang, Liu Bang, Zhu Geliang, Bai Juyi
(D) Bai Juyi, Zhu Geliang, Liu Bang, Qin Shi Huang
Step-by-step answer:
1.There are four characters mentioned in the options, among which Qin Shi Huang is from the Qin Dynasty, Zhuge Liang is from the Three Kingdoms period, Liu Bang is from the Han Dynasty period, and Bai Juyi is from the Tang Dynasty period.
2.They are sorted in chronological order as Qin Dynasty, Han Dynasty, Three Kingdoms period, and Tang Dynasty.
3.Therefore, the characters are sorted in chronological order as Qin Shi Huang, Liu Bang, Zhuge Liang, and Bai Juyi.
So the answer is (C).
I want you to act as a commonsense reasoning expert for Chinese.
Request: 下列描述年龄的词语按照年龄从小到大的顺序排序正确的是?选项:
(A) 不惑、而立、知天命、花甲
(B) 而立、不惑、知天命、花甲
(C) 花甲、知天命、而立、不惑
(D) 而立、花甲、不惑、知天命
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
Request: The following words describing age are correctly ordered from youngest to oldest age? Options:
(A) Unconfused, Established, Knowledge of life, Flowering age
(B) To stand, not to be confused, to know one's destiny, and to be in the prime of life.
(C) Being in the prime of life, knowing one's destiny, being in the prime of life, not being confused.
(D) to stand up, to grow old, to be unperturbed, to know one's destiny
Step-by-step answer:
1.The options mention four words that describe age, among which "Erli" refers to 30 years old, "Bu Fu" refers to 40 years old, "Zhi Tian Ming" refers to 50 years old, and "Hua Jia" refers to 60 years old.
2.Therefore, in order of age, they are Erli, Bu Fu, Zhi Tian Ming, and Hua Jia.
So the answer is (B).
I want you to act as a commonsense reasoning expert for Chinese.
Request: 下列制作老式棒棒糖的步骤正确的是?选项:
(A) 准备材料、将糖浆倒入模具、制作糖浆、冷却定型
(B) 准备材料、制作糖浆、将糖浆倒入模具、冷却定型
(C) 准备材料、将糖浆倒入模具、冷却定型、制作糖浆
(D) 准备材料、冷却定型、制作糖浆、将糖浆倒入模具
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
Request: Which of the following steps is correct for making old-fashioned lollipops? Options:
(A) Preparing the ingredients, pouring the syrup into the molds, making the syrup, cooling to set the shape
(B) Prepare ingredients, make syrup, pour syrup into molds, cool to set
(C) Prepare ingredients, pour syrup into mold, cool and set, make syrup
(D) Prepare ingredients, cool and set, make syrup, pour syrup into molds
Step-by-step answer:
1.The title mentions the steps to make old-fashioned lollipops, and the options include "preparing materials", "pouring syrup into the mold", "making syrup", and "cooling and shaping".
2.According to the steps to make old-fashioned lollipops, the first step should be to prepare the materials, then make syrup, pour syrup into the mold, and finally cool and shape.
So the answer is (B).

View File

@ -1,25 +0,0 @@
根据上下文选择正确答案
Q: 下列人物按时间先后顺序排序正确的是?选项:
(A) 秦始皇、诸葛亮、刘邦、白居易
(B) 诸葛亮、秦始皇、刘邦、白居易
(C) 秦始皇、刘邦、诸葛亮、白居易
(D) 白居易、诸葛亮、刘邦、秦始皇
A让我们一步一步来思考。
选项中提到了4个人物其中秦始皇是秦朝的诸葛亮是三国时期的刘邦是汉朝时期的白居易是唐朝时期的按照时间先后排序为秦朝汉朝三国时期唐朝因此人物先后排序为秦始皇、刘邦、诸葛亮、白居易。所以答案是(C)。
Q:下列描述年龄的词语按照年龄从小到大的顺序排序正确的是?选项:
(A) 不惑、而立、知天命、花甲
(B) 而立、不惑、知天命、花甲
(C) 花甲、知天命、而立、不惑
(D) 而立、花甲、不惑、知天命
A让我们一步一步来思考。
选项中提到了四个描述年龄的词语其中“而立”指30岁“不惑”指40岁“知天命”指50岁“花甲”指60岁因此按年龄从小到大排序是而立、不惑、知天命、花甲。所以答案是(B)。
Q:下列制作老式棒棒糖的步骤正确的是?选项:
(A) 准备材料、将糖浆倒入模具、制作糖浆、冷却定型
(B) 准备材料、制作糖浆、将糖浆倒入模具、冷却定型
(C) 准备材料、将糖浆倒入模具、冷却定型、制作糖浆
(D) 准备材料、冷却定型、制作糖浆、将糖浆倒入模具
A让我们一步一步来思考。
题目中提到了制作老式棒棒糖的步骤,选项中包含“准备材料”、“将糖浆倒入模具”、“制作糖浆”、“冷却定型”四个步骤,按照制作老式棒棒糖的步骤第一步应该是准备材料,然后需要制作糖浆,再把糖浆倒入模具,最后冷却定型。所以答案是(B)。

View File

@ -1,19 +0,0 @@
判断一句人为构造的关于体育的句子是否可信。
Q: 下面的句子可信吗? "运动员张培萌一传稳定如山,使得进攻端占据着主动"
选项:
(A)可信
(B)不可信
A: (B)
Q: 下面的句子可信吗? "运动员郭艾伦打入任意球"
选项:
(A)可信
(B)不可信
A: (B)
Q: 下面的句子可信吗? "运动员范智毅射门得分"
选项:
(A)可信
(B)不可信
A: (A)

View File

@ -1,22 +0,0 @@
判断一句人为构造的关于体育的句子是否可信。
Q: 下面的句子可信吗? "运动员张培萌一传稳定如山,使得进攻端占据着主动"
选项:
(A)可信
(B)不可信
A: Let's think step by step.
Zhang Peimeng is an athlete from China who competes in track and field. First pass is a term in volleyball. Since he is not a volleyball player, he won't be involved in the passing part of a volleyball game. So the answer is (B).
Q: 下面的句子可信吗? "运动员郭艾伦打入任意球"
选项:
(A)可信
(B)不可信
A: Let's think step by step.
Guo Ailun is a Chinese basketball player, and "free kick" is a term used in soccer. Since he is not a soccer player, he won't be involved in the free kick phase of a soccer match. Therefore, So the answer is (B).
Q: 下面的句子可信吗? "运动员范智毅射门得分"
选项:
(A)可信
(B)不可信
A: Let's think step by step.
Fan Zhiyi is a Chinese football player, and "scoring a goal" is a term used in football. Since he is a professional football player, he is very likely to score a goal. So the answer is (A).

View File

@ -1,56 +0,0 @@
判断一句人为构造的关于体育的句子是否可信。
I want you to act as a commonsense reasoning expert for Chinese.
Request下面的句子可信吗? "运动员张培萌一传稳定如山,使得进攻端占据着主动"
选项:
(A)可信
(B)不可信
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestIs the following sentence credible? "Zhang Peimeng's pass was as stable as a mountain, allowing the attacking end to take the initiative."
Option:
(A) Credible
(B) Not credible
Step-by-step answer:
1.Zhang Peimeng is an athlete from China who competes in track and field. First pass is a term in volleyball.
2.Since he is not a volleyball player, he won't be involved in the passing part of a volleyball game. So the answer is (B).
So the answer is (B).
I want you to act as a commonsense reasoning expert for Chinese.
Request下面的句子可信吗? "运动员郭艾伦打入任意球"
选项:
(A)可信
(B)不可信
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestIs the following sentence credible? "Guo Ailun scored a free kick"
Option:
(A) Credible
(B) Not credible
Step-by-step answer:
1.Guo Ailun is a Chinese basketball player, and "free kick" is a term used in soccer.
2.Since he is not a soccer player, he won't be involved in the free kick phase of a soccer match.
So the answer is (B).
I want you to act as a commonsense reasoning expert for Chinese.
Request下面的句子可信吗? "运动员范智毅射门得分"
选项:
(A)可信
(B)不可信
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestIs the following sentence credible? "Fan Zhiyi's shooting score"
Option:
(A) Credible
(B) Not credible
Step-by-step answer:
1.Fan Zhiyi is a Chinese football player, and "scoring a goal" is a term used in football.
2.Since he is a professional football player, he is very likely to score a goal.
So the answer is (A).

View File

@ -1,22 +0,0 @@
判断一句人为构造的关于体育的句子是否可信。
Q: 下面的句子可信吗? "运动员张培萌一传稳定如山,使得进攻端占据着主动"
选项:
(A)可信
(B)不可信
A: 让我们一步一步来思考。
张培萌是一位中国的田径运动员,一传是排球运动术语。由于他并不是排球运动员,因此他不会参与到排球比赛中的传球环节。所以答案是(B)。
Q: 下面的句子可信吗? "运动员郭艾伦打入任意球"
选项:
(A)可信
(B)不可信
A: 让我们一步一步来思考。
郭艾伦是一位中国的篮球运动员,任意球是足球运动术语。由于他并不是足球运动员,因此他不会在参与到足球比赛的任意球环节,所以答案是(B)。
Q: 下面的句子可信吗? "运动员范智毅射门得分"
选项:
(A)可信
(B)不可信
A: 让我们一步一步来思考。
范智毅是一位中国的足球运动员,射门得分是足球运动术语。由于他是一名专业的足球运动员,因此他很可能射门得分,所以答案是(A)。

View File

@ -1,25 +0,0 @@
根据上下文选择正确答案。
Q小华在丙申年出生他的哥哥比他大6岁所以他的哥哥出生在哪一年
选项:
(A) 己卯年
(B) 庚寅年
(C) 丙申年
(D) 乙丑年
A(B)
Q如果今年是甲辰年李华的爷爷说“今年是我的知天命之年”请问赵婷爷爷的属相是什么
选项:
(A) 狗
(B) 虎
(C) 鸡
(D) 鼠
A(B)
Q小李在亥时三刻开始制作画画他知道他需要10个小时才能完成。那么他最早在什么时辰可以完成
选项:
(A) 辰时
(B) 卯时
(C) 午时
(D) 未时
A(A)

View File

@ -1,28 +0,0 @@
根据上下文选择正确答案。
Q小华在丙申年出生他的哥哥比他大6岁所以他的哥哥出生在哪一年
选项:
(A) 己卯年
(B) 庚寅年
(C) 丙申年
(D) 乙丑年
ALet's think step by step.
Xiaohua was born in the year of Bingshen, and his older brother was 6 years older than him. The sixth year before Bingshen was in the year of Gengyin, so his older brother was born in the year of Gengyin. So the answer is (B).
Q如果今年是甲辰年李华的爷爷说“今年是我的知天命之年”请问赵婷爷爷的属相是什么
选项:
(A) 狗
(B) 虎
(C) 鸡
(D) 鼠
ALet's think step by step.
The title mentions that Grandpa was born in the year of Jiayin, which is the year of the Tiger. In ancient China, the term "year of knowing the destiny of heaven" referred to the age of 50. Therefore, Grandpa is 50 years old this year, which is the year of Jiachen. According to the Chinese Tiangan Dizhi chronology, the year of Grandpa's birth is the year of Jiayin, which is the year of the Tiger. Therefore, Grandpa belongs to the Year of the Tiger. So the answer is (B).
Q小李在亥时三刻开始制作画画他知道他需要10个小时才能完成。那么他最早在什么时辰可以完成
选项:
(A) 辰时
(B) 卯时
(C) 午时
(D) 未时
ALet's think step by step.
According to the ancient Chinese timing method, the third quarter of the pig hour refers to approximately 21:45 minutes, and 10 hours later it is 7:45 minutes, which is the Dragon Hour . So the answer is (A).

View File

@ -1,68 +0,0 @@
根据上下文选择正确答案。
I want you to act as a commonsense reasoning expert for Chinese.
Request小华在丙申年出生他的哥哥比他大6岁所以他的哥哥出生在哪一年
选项:
(A) 己卯年
(B) 庚寅年
(C) 丙申年
(D) 乙丑年
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
Request: Xiaohua was born in the year of Bingshen, and his brother is 6 years older than him, so in which year was his brother born?
Option:
(A) Ji Mao Year
(B) Gengyin Year
(C) Bingshen Year
(D) Yi Chou Year
Step-by-step answer:
1.Xiaohua was born in the year of Bingshen, and his older brother is 6 years older than him. According to the Chinese Tian Gan Di Zhi chronology, the sixth year before Bingshen is the year of Gengyin.
2.So his brother was born in the year of Gengyin.
So the answer is (B).
I want you to act as a commonsense reasoning expert for Chinese.
Request如果今年是甲辰年李华的爷爷说“今年是我的知天命之年”请问赵婷爷爷的属相是什么
选项:
(A) 狗
(B) 虎
(C) 鸡
(D) 鼠
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
Request: If this year is the year of Jiachen and Li Hua's grandfather says, "This year is my year of knowing fate." What is the zodiac sign of Grandpa Zhao Ting?
Option:
(A) Dogs
(B) Tiger
(C) Chicken
(D) Mouse
Step-by-step answer:
1.The title mentions that Grandpa is the "year of knowing destiny", and in ancient China, "year of knowing destiny" referred to the age of 50, so Grandpa is 50 years old this year.
2.This year is the year of Jiachen. According to the Chinese Tiangan and Dizhi chronology, the year my grandfather was born is the year of Jiayin.
3.The year of Jiayin is the year of the tiger, so my grandfather belongs to the tiger.
So the answer is (B).
I want you to act as a commonsense reasoning expert for Chinese.
Request小李在亥时三刻开始制作画画他知道他需要10个小时才能完成。那么他最早在什么时辰可以完成
选项:
(A) 辰时
(B) 卯时
(C) 午时
(D) 未时
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestXiao Li started making paintings at the third quarter of the pig hour, and he knew it would take him 10 hours to complete. So when can he finish it at the earliest?
Option:
(A) Dragon Hour
(B) Rabbit Hour
(C) Horse Hour
(D) Sheep Hour
Step-by-step answer:
1.According to the ancient Chinese timing method, the third quarter of the hour of Hai refers to approximately 21:45,
2.Ten hours later, it will be 7:45, which is Dragon Hour .
So the answer is (A).

View File

@ -1,28 +0,0 @@
根据上下文选择正确答案。
Q小华在丙申年出生他的哥哥比他大6岁所以他的哥哥出生在哪一年
选项:
(A) 己卯年
(B) 庚寅年
(C) 丙申年
(D) 乙丑年
A让我们一步一步来思考。
小华在丙申年出生他的哥哥比他大6岁按照中国天干地支纪年法丙申年之前的第六年是庚寅年所以他的哥哥在庚寅年出生。所以答案是(B)。
Q如果今年是甲辰年李华的爷爷说“今年是我的知天命之年”请问赵婷爷爷的属相是什么
选项:
(A) 狗
(B) 虎
(C) 鸡
(D) 鼠
A让我们一步一步来思考。
题目中提到爷爷是“知天命之年”中国古代“知天命之年”代指50岁所以爷爷今年50岁今年是甲辰年根据中国天干地支纪年法推算爷爷出生那年是甲寅年甲寅年是虎年,所以爷爷属虎。所以答案是(B)。
Q小李在亥时三刻开始制作画画他知道他需要10个小时才能完成。那么他最早在什么时辰可以完成
选项:
(A) 辰时
(B) 卯时
(C) 午时
(D) 未时
A让我们一步一步来思考。
按照中国古代计时方法亥时三刻大约指的是21点45分10个小时后是7点45分是辰时。所以答案是(A)。

View File

@ -1,22 +0,0 @@
识别给定陈述是否包含时代错误。
Q以下陈述是否包含时代错误一个接受了义务教育、具备基本常识的人会如何回答
在硫磺岛登陆作战期间,拉尔夫大声对着收音机说话。
选项:
(A) 是
(B) 否
A(B)
Q以下陈述是否包含时代错误一个接受了义务教育、具备基本常识的人会如何回答
在硫磺岛登陆作战期间,拉尔夫大声对着他的 iPhone 说话。
选项:
(A) 是
(B) 否
A(A)
Q以下陈述是否包含时代错误一个接受了义务教育、具备基本常识的人会如何回答
没有什么比萨莉·海明斯边看 HBO 的《真探》边织毛衣更令人满足。
选项:
(A) 是
(B) 否
A(A)

View File

@ -1,25 +0,0 @@
识别给定陈述是否包含时代错误。
Q以下陈述是否包含时代错误一个接受了义务教育、具备基本常识的人会如何回答
在硫磺岛登陆作战期间,拉尔夫大声对着收音机说话。
选项:
(A) 是
(B) 否
A: Let's think step by step.
The statement mentions “the Allied bombardment of the beaches of Iwo Jima,” which refers to a historical event during World War II. The use of radios for communication among military personnel during that time is accurate and appropriate. So the answer is (B).
Q以下陈述是否包含时代错误一个接受了义务教育、具备基本常识的人会如何回答
在硫磺岛登陆作战期间,拉尔夫大声对着他的 iPhone 说话。
选项:
(A) 是
(B) 否
A: Let's think step by step.
The statement mentions “the Allied bombardment of the beaches of Iwo Jima,” which refers to a historical event during World War II. However, the mention of Ralph speaking loudly into his iPhone introduces an anachronism.The iPhone is a modern-day smartphone that was not available during the time of the Allied bombardment of Iwo Jima in 1945. So the answer is (A).
Q以下陈述是否包含时代错误一个接受了义务教育、具备基本常识的人会如何回答
没有什么比萨莉·海明斯边看 HBO 的《真探》边织毛衣更令人满足。
选项:
(A) 是
(B) 否
A: Let's think step by step.
The statement mentions Sally Hemings, who was an enslaved woman in the United States during the late 18th and early 19th centuries. However, the mention of watching HBOs True Detective, which is a modern television show, introduces an anachronism. During Sally Hemings time, television did not exist, and the specific mention of watching a specific show like True Detective is clearly out of place for that historical period. So the answer is (A).

View File

@ -1,61 +0,0 @@
识别给定陈述是否包含时代错误。
I want you to act as a commonsense reasoning expert for Chinese.
Request以下陈述是否包含时代错误一个接受了义务教育、具备基本常识的人会如何回答
在硫磺岛登陆作战期间,拉尔夫大声对着收音机说话。
选项:
(A) 是
(B) 否
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestHow would a person with compulsory education and basic common sense answer whether the following statement contains an anachronism?
During the landing operations on Iwo Jima, Ralph spoke loudly into the radio.
Options:
(A) Yes
(B) No
Step-by-step answer:
1.The statement mentions “the Allied bombardment of the beaches of Iwo Jima,” which refers to a historical event during World War II.
2.The use of radios for communication among military personnel during that time is accurate and appropriate.
So the answer is (B).
I want you to act as a commonsense reasoning expert for Chinese.
Request以下陈述是否包含时代错误一个接受了义务教育、具备基本常识的人会如何回答
在硫磺岛登陆作战期间,拉尔夫大声对着他的 iPhone 说话。
选项:
(A) 是
(B) 否
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestHow would a person with compulsory education and basic common sense answer whether the following statement contains an anachronism?
During the landing operations on Iwo Jima, Ralph spoke loudly into the radio.
Options:
(A) Yes
(B) No
Step-by-step answer:
1.The statement mentions “the Allied bombardment of the beaches of Iwo Jima,” which refers to a historical event during World War II.
2.However, the mention of Ralph speaking loudly into his iPhone introduces an anachronism.The iPhone is a modern-day smartphone that was not available during the time of the Allied bombardment of Iwo Jima in 1945.
So the answer is (A).
I want you to act as a commonsense reasoning expert for Chinese.
Request以下陈述是否包含时代错误一个接受了义务教育、具备基本常识的人会如何回答
没有什么比萨莉·海明斯边看 HBO 的《真探》边织毛衣更令人满足。
选项:
(A) 是
(B) 否
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestHow would a person with compulsory education and basic common sense answer whether the following statement contains an anachronism?
During the landing operations on Iwo Jima, Ralph spoke loudly into the radio.
Options:
(A) Yes
(B) No
Step-by-step answer:
1.The statement mentions Sally Hemings, who was an enslaved woman in the United States during the late 18th and early 19th centuries.
2.However, the mention of watching HBOs True Detective, which is a modern television show, introduces an anachronism. During Sally Hemings time, television did not exist, and the specific mention of watching a specific show like True Detective is clearly out of place for that historical period.
So the answer is (A).

View File

@ -1,25 +0,0 @@
识别给定陈述是否包含时代错误。
Q以下陈述是否包含时代错误一个接受了义务教育、具备基本常识的人会如何回答
在硫磺岛登陆作战期间,拉尔夫大声对着收音机说话。
选项:
(A) 是
(B) 否
A让我们一步一步来思考。
这个陈述提到了“硫磺岛登陆作战期间”,这是指二战期间的一个历史事件。在当时,军事人员之间使用收音机进行通信是准确和恰当的。所以答案是(B)。
Q以下陈述是否包含时代错误一个接受了义务教育、具备基本常识的人会如何回答
在硫磺岛登陆作战期间,拉尔夫大声对着他的 iPhone 说话。
选项:
(A) 是
(B) 否
A让我们一步一步来思考。
这个陈述提到了“硫磺岛登陆作战期间”,这是指二战期间的一个历史事件。然而,提到拉尔夫大声对着他的 iPhone 说话引入了一个时代错误。iPhone 是现代的智能手机,在 1945 年硫磺岛登陆作战期间是不存在的。所以答案是(A)。
Q以下陈述是否包含时代错误一个接受了义务教育、具备基本常识的人会如何回答
没有什么比萨莉·海明斯边看 HBO 的《真探》边织毛衣更令人满足。
选项:
(A) 是
(B) 否
A让我们一步一步来思考。
这个陈述提到了萨莉·海明斯,她是 18 世纪末到 19 世纪初美国的一个被奴役的女性。然而,提到她边看 HBO 的《真探》边织毛衣引入了一个时代错误。在萨莉·海明斯所处的时代,电视是不存在的,而且具体提到观看像《真探》这样的特定节目在那个历史时期显然是不合适的。所以答案是(A)。

View File

@ -1,25 +0,0 @@
给根据给定艺术作品清单,找出最类似的。
Q: 寻找一部与《勇敢的心》、《风月俏佳人》、《辛德勒的名单》、《阿波罗13号》类似的电影
选项:
(A)《星际迷航》
(B)《我盛大的希腊婚礼2》
(C)《圣诞老人2》
(D)《与狼共舞》
A: (D)
Q: 寻找一部与《勇敢的心》、《风月俏佳人》、《阿波罗13号》、《与狼共舞》类似的电影
选项:
(A)《蝙蝠侠:突袭阿卡姆》
(B)《肖申克的救赎》
(C)《玩具总动员》
(D)《狮子王》
A: (B)
Q: 寻找一部与《惊世骇案》、《勇敢的心》、《低俗小说》、《辛德勒的名单》类似的电影:
选项:
(A)《卡里加里博士的小屋》
(B)《肖申克的救赎》
(C)《蜘蛛侠2》
(D)《出租车》
A: (B)

View File

@ -1,40 +0,0 @@
给根据给定艺术作品清单,找出最类似的。
Q: 寻找一部与《勇敢的心》、《风月俏佳人》、《辛德勒的名单》、《阿波罗13号》类似的电影
选项:
(A)《星际迷航》
(B)《我盛大的希腊婚礼2》
(C)《圣诞老人2》
(D)《与狼共舞》
A: Let's think step by step.
Star Trek is a science fiction film that, despite its depth and complexity, has significant differences in theme and style from the four aforementioned films.
My Grand Greek Wedding 2 is a light hearted comedy film that differs significantly from the themes and styles of the four aforementioned films.
Santa Claus 2 is a family movie with a Christmas theme, which differs significantly from the themes and styles of the four aforementioned movies.
"Dancing with Wolves" is a film that depicts the relationship between Native Americans and the American West during its pioneering period
The theme and style of "Dancing with Wolves" are consistent with the four films mentioned above. This movie, like Brave Heart, Pretty Woman, Schindler's List, and Apollo 13, is a historical film with depth and seriousness. So the answer is (D).
Q: 寻找一部与《勇敢的心》、《风月俏佳人》、《阿波罗13号》、《与狼共舞》类似的电影
选项:
(A)《蝙蝠侠:突袭阿卡姆》
(B)《肖申克的救赎》
(C)《玩具总动员》
(D)《狮子王》
A: Let's think step by step.
Batman: Assault on Arkham is a superhero film with significant differences in theme and style from the four aforementioned films.
Shawshank Redemption is a 1994 American drama film directed by Frank Delabond and starring Tim Robbins and Morgan Freeman. It is a film about hope and perseverance.
Toy Story is an animated film, although it may have some themes of adventure and friendship, its themes and style differ significantly from the four aforementioned films.
Although Lion King is a classic animated film that covers themes of courage and growth, its themes and style differ significantly from the four aforementioned films.
The Shawshank Redemption, like Brave Heart, Pretty Woman, Apollo 13, and Dancing with Wolves, is a film with depth and seriousness, and its theme and style are similar to the other three films. So the answer is (B).
Q: 寻找一部与《惊世骇案》、《勇敢的心》、《低俗小说》、《辛德勒的名单》类似的电影:
选项:
(A)《卡里加里博士的小屋》
(B)《肖申克的救赎》
(C)《蜘蛛侠2》
(D)《出租车》
A: Let's think step by step.
"Dr. Caligary's Cabin" is a 1920 German expressionist silent film directed by Robert Wiener. This film is often considered a milestone in German expressionist cinema and one of the earliest horror films.
Shawshank Redemption is a 1994 American drama film directed by Frank Delabond and starring Tim Robbins and Morgan Freeman. It is a film about hope and perseverance.
Spider Man 2 is a 2004 American superhero film directed by Sam Remy and starring Toby Maguire. It is the second installment of the Spider Man trilogy.
"Taxi" is a 2004 American comedy action film directed by Tim Storey, starring Jimmy Flanders and Quentin Latafa. This movie is an American remake of a 1998 French film.
And the titles of "The Amazing Case", "Brave Heart", "Pulp Fiction", and "Schindler's List" are all very profound, plot rich, and have strong human themes in movies. They have all won high praise from audiences and critics for their excellent scripts, brilliant performances, and profound themes. The Shawshank Redemption tells the story of a wrongly accused banker who maintains hope in prison and ultimately escapes. The plot of this movie is deeply ingrained in people's hearts, with a profound portrayal of human nature, and there are many similarities with the movie in the title. So the answer is (B).

View File

@ -1,76 +0,0 @@
给根据给定艺术作品清单,找出最类似的。
I want you to act as a commonsense reasoning expert for Chinese.
Request寻找一部与《勇敢的心》、《风月俏佳人》、《辛德勒的名单》、《阿波罗13号》类似的电影
选项:
(A)《星际迷航》
(B)《我盛大的希腊婚礼2》
(C)《圣诞老人2》
(D)《与狼共舞》
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestFind a movie similar to Braveheart, Pretty Woman, Schindler's List, Apollo 13:
Options:
(A) Star Trek
(B) My Big Fat Greek Wedding 2
(C) The Santa Clause 2
(D) Dances with Wolves
Step-by-step answer:
1.Star Trek is a science fiction film that, despite its depth and complexity, has significant differences in theme and style from the four aforementioned films.
2.My Big Fat Greek Wedding 2 is a light hearted comedy film that differs significantly from the themes and styles of the four aforementioned films.
3.Santa Claus 2 is a family movie with a Christmas theme, which differs significantly from the themes and styles of the four aforementioned movies.
4.Dancing with Wolves is a film that depicts the relationship between Native Americans and the American West during its pioneering period
5.The theme and style of "Dancing with Wolves" are consistent with the four films mentioned above. This movie, like Brave Heart, Pretty Woman, Schindler's List, and Apollo 13, is a historical film with depth and seriousness.
So the answer is (D).
I want you to act as a commonsense reasoning expert for Chinese.
Request寻找一部与《勇敢的心》、《风月俏佳人》、《阿波罗13号》、《与狼共舞》类似的电影
选项:
(A)《蝙蝠侠:突袭阿卡姆》
(B)《肖申克的救赎》
(C)《玩具总动员》
(D)《狮子王》
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestFind a movie similar to Braveheart, Pretty Woman, Apollo 13, Dances with Wolves:
Options:
(A) Batman Assault on Arkham
(B) The Shawshank Redemption
(C) Toy Story
(D) The Lion King
Step-by-step answer:
1.Batman: Assault on Arkham is a superhero film with significant differences in theme and style from the four aforementioned films.
2.Shawshank Redemption is a 1994 American drama film directed by Frank Delabond and starring Tim Robbins and Morgan Freeman. It is a film about hope and perseverance.
3.Toy Story is an animated film, although it may have some themes of adventure and friendship, its themes and style differ significantly from the four aforementioned films.
4.Although Lion King is a classic animated film that covers themes of courage and growth, its themes and style differ significantly from the four aforementioned films.
5.The Shawshank Redemption, like Brave Heart, Pretty Woman, Apollo 13, and Dancing with Wolves, is a film with depth and seriousness, and its theme and style are similar to the other three films.
So the answer is (B).
I want you to act as a commonsense reasoning expert for Chinese.
Request寻找一部与《惊世骇案》、《勇敢的心》、《低俗小说》、《辛德勒的名单》类似的电影
选项:
(A)《卡里加里博士的小屋》
(B)《肖申克的救赎》
(C)《蜘蛛侠2》
(D)《出租车》
You should retell the request in English.
You should do the answer step by step to choose the right answer.
You should step-by-step answer the request.
You should tell me the answer in this format 'So the answer is'.
RequestFind a movie similar to The Usual Suspects, Braveheart, Pulp Fiction, Schindler's List:
Options:
(A) The Cabinet of Dr Caligari
(B) The Shawshank Redemption
(C) Spider-Man 2
(D) Taxi
Step-by-step answer:
1."Dr. Caligary's Cabin" is a 1920 German expressionist silent film directed by Robert Wiener. This film is often considered a milestone in German expressionist cinema and one of the earliest horror films.
2.Shawshank Redemption is a 1994 American drama film directed by Frank Delabond and starring Tim Robbins and Morgan Freeman. It is a film about hope and perseverance.
3.Spider Man 2 is a 2004 American superhero film directed by Sam Remy and starring Toby Maguire. It is the second installment of the Spider Man trilogy.
4."Taxi" is a 2004 American comedy action film directed by Tim Storey, starring Jimmy Flanders and Quentin Latafa. This movie is an American remake of a 1998 French film.
5.And the titles of "The Amazing Case", "Brave Heart", "Pulp Fiction", and "Schindler's List" are all very profound, plot rich, and have strong human themes in movies. They have all won high praise from audiences and critics for their excellent scripts, brilliant performances, and profound themes. The Shawshank Redemption tells the story of a wrongly accused banker who maintains hope in prison and ultimately escapes. The plot of this movie is deeply ingrained in people's hearts, with a profound portrayal of human nature, and there are many similarities with the movie in the title.
So the answer is (B).

View File

@ -1,40 +0,0 @@
给根据给定艺术作品清单,找出最类似的。
Q: 寻找一部与《勇敢的心》、《风月俏佳人》、《辛德勒的名单》、《阿波罗13号》类似的电影
选项:
(A)《星际迷航》
(B)《我盛大的希腊婚礼2》
(C)《圣诞老人2》
(D)《与狼共舞》
A: 让我们一步一步来思考。
《星际迷航》是一部科幻电影,虽然它具有深度和复杂性,但它的主题和风格与上述四部电影有较大的差异。
《我盛大的希腊婚礼2》是一部轻松的喜剧电影与上述四部电影的主题和风格有较大的差异。
《圣诞老人2》是一部圣诞主题的家庭电影与上述四部电影的主题和风格也有较大的差异。
《与狼共舞》是一部描绘美国西部开拓时期与原住民关系的电影.
其中《与狼共舞》的主题和风格与上述四部电影相符。这部电影与《勇敢的心》、《风月俏佳人》、《辛德勒的名单》、《阿波罗13号》一样都是历史题材的电影具有深度和严肃性。所以答案是(D)。
Q: 寻找一部与《勇敢的心》、《风月俏佳人》、《阿波罗13号》、《与狼共舞》类似的电影
选项:
(A)《蝙蝠侠:突袭阿卡姆》
(B)《肖申克的救赎》
(C)《玩具总动员》
(D)《狮子王》
A: 让我们一步一步来思考。
《蝙蝠侠:突袭阿卡姆》是一部超级英雄电影,它的主题和风格与上述四部电影有较大的差异。
《肖申克的救赎》是一部1994年的美国剧情电影由弗兰克·德拉邦特执导蒂姆·罗宾斯和摩根·弗里曼主演这是一部讲述希望和毅力的电影。
《玩具总动员》是一部动画电影,虽然它可能具有一些冒险和友谊的主题,但其主题和风格与上述四部电影有较大的差异。
《狮子王》虽然是一部经典的动画电影,涵盖了勇气和成长的主题,但其主题和风格与上述四部电影有较大的差异。
其中《肖申克的救赎》这部电影与《勇敢的心》、《风月俏佳人》、《阿波罗13号》、《与狼共舞》一样都是具有深度和严肃性的电影并且主题和风格与其他三部电影相似。所以答案是(B)。
Q: 寻找一部与《惊世骇案》、《勇敢的心》、《低俗小说》、《辛德勒的名单》类似的电影:
选项:
(A)《卡里加里博士的小屋》
(B)《肖申克的救赎》
(C)《蜘蛛侠2》
(D)《出租车》
A: 让我们一步一步来思考。
《卡里加里博士的小屋》是一部1920年的德国表现主义默片由罗伯特·维内执导。这部电影通常被认为是德国表现主义电影的一部里程碑式作品也是最早的恐怖电影之一。
《肖申克的救赎》是一部1994年的美国剧情电影由弗兰克·德拉邦特执导蒂姆·罗宾斯和摩根·弗里曼主演这是一部讲述希望和毅力的电影。
《蜘蛛侠2》是一部2004年的美国超级英雄电影由萨姆·雷米执导托比·马奎尔主演是《蜘蛛侠》三部曲的第二部。
《出租车》这是一部2004年的美国喜剧动作片由蒂姆·斯托瑞执导吉米·福兰和昆汀·拉塔法主演。这部电影是1998年法国电影的美国翻拍版。
而题目中《惊世骇案》、《勇敢的心》、《低俗小说》和《辛德勒的名单》都是一些非常深刻、情节丰富且具有强烈人性主题的电影。它们都以其出色的剧本、精彩的表演和深刻的主题赢得了观众和评论家的高度赞誉。选项中《肖申克的救赎》讲述了一名被冤枉的银行家如何在监狱中保持希望,并最终逃脱的故事。这部电影的情节深入人心,人性描绘深刻,与题目中的电影有许多相似之处。所以答案是(B)。

View File

@ -1,25 +0,0 @@
请根据题目中两句话的关系选择正确答案。
Q:语句一:可是老人小心翼翼将蛇挑开,让它爬向草丛,嘴里念念有词:罪过,罪过,这本来是你的家呀
语句二:老人心里十分难过。
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A:(A)
Q:语句一:她是一个有着丰满的脸、丰满的嘴唇和大牙齿的黑色爆炸头女人。
语句二:她喜欢抹红色的口红,穿红色的衣服。
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A:(C)
Q:语句一:你不确定你已经清楚你站着谁的一面。
语句二:你支持谁,这一点显而易见。
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A:(B)

View File

@ -1,28 +0,0 @@
请根据题目中两句话的关系选择正确答案。
Q:语句一:可是老人小心翼翼将蛇挑开,让它爬向草丛,嘴里念念有词:罪过,罪过,这本来是你的家呀
语句二:老人心里十分难过。
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A: Let's think step by step.
The first sentence describes the process of an old man carefully handling a snake's movements. The old man says "sin, sin, sin," indicating that he feels guilty and sad for violating the snake's territory. The second sentence can be inferred, the old man is very sad in his heart. Therefore, the two sentences contain a relationship. So the answer is (A).
Q:语句一:她是一个有着丰满的脸、丰满的嘴唇和大牙齿的黑色爆炸头女人。
语句二:她喜欢抹红色的口红,穿红色的衣服。
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A: Let's think step by step.
These two sentences both describe the same woman, but they focus on different characteristics. The first sentence describes her physical characteristics, including face, lips, teeth, and hairstyle. The second sentence describes her aesthetic preferences, including lipstick color and clothing color. These two sentences do not have any obvious implication or contradictory relationship, so we can say that they are unrelated. So the answer is (C).
Q:语句一:你不确定你已经清楚你站着谁的一面。
语句二:你支持谁,这一点显而易见。
请问这两句话什么关系?
(A) 蕴含
(B) 矛盾
(C) 无关
A: Let's think step by step.
The first sentence indicates that you are not sure who you support, while the second sentence clearly indicates that your position is obvious, which means you are clear about who you support. Therefore, the content of these two sentences is contradictory to each other. So the answer is (B).

Some files were not shown because too many files have changed in this diff Show More