Songyang Zhang
|
fb43dd1906
|
[Update] Update Skywork/Qwen-QwQ (#1728)
* Update JuderBench
* Support O1-style Prompts
* Update Code
* Update OpenAI
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update
|
2024-12-05 19:30:43 +08:00 |
|
Junnan Liu
|
6181ac1122
|
[Update] Update LiveMathBench Evaluation to Support Single Dataset Split Metric Computation (#1730)
* upload dataset definitions & configs
* add single dataset split specific metrics
* add k-pass@threshold & MATH500
|
2024-12-05 16:54:16 +08:00 |
|
Linchen Xiao
|
ac23f0ce1f
|
[Update] Update init file for Korbench (#1737)
|
2024-12-05 11:26:00 +08:00 |
|
Linchen Xiao
|
9de27b4d85
|
[Update] Update max_out_len for datasets (#1726)
* [Update] Update max_out_len for datasets
* Update eval_regression_chat_objective_fullbench.py
* Update eval_regression_chat.py
* Update eval_regression_chat.py
* Update oc_score_baseline_fullbench.yaml
---------
Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>
|
2024-12-02 11:42:07 +08:00 |
|
Junnan Liu
|
fe6d76fb13
|
[Feature] Support LiveMathBench (#1727)
|
2024-11-30 00:07:19 +08:00 |
|
liushz
|
c437135fad
|
[Feature] Add Openai Simpleqa dataset (#1720)
* Add Openai SimpleQA dataset
* Add Openai SimpleQA dataset
* Add Openai SimpleQA dataset
* Update eval_simpleqa.py
---------
Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
|
2024-11-28 19:16:07 +08:00 |
|
wanyu2018umac
|
90efcf2216
|
[Feature] Add P-MMEval (#1714)
* Update with PMMEval
* Update
* Update __init__.py
* Fix Bugs
* Delete .pre-commit-config.yaml
* Pull merge
---------
Co-authored-by: liushz <qq1791167085@163.com>
|
2024-11-27 21:26:18 +08:00 |
|
Junnan Liu
|
f7dbe6bb7d
|
[Feature] Add Arc Prize Public Evaluation (#1690)
* support arc prize
* update arc-prize dataset info & update arc-prize evaluation performance
|
2024-11-27 15:44:41 +08:00 |
|
Linchen Xiao
|
ef695e28e5
|
[Bug] Fix Korbench dataset module (#1717)
|
2024-11-26 17:13:28 +08:00 |
|
Songyang Zhang
|
f97c4eae42
|
[Update] Update Fullbench (#1712)
* Update JuderBench
* Support O1-style Prompts
* Update Code
|
2024-11-26 14:26:55 +08:00 |
|
Yufeng Zhao
|
300adc31e8
|
[Feature] Add Korbench dataset (#1713)
* first version for korbench
* first stage for korbench
* korbench_1
* korbench_1
* korbench_1
* korbench_1
* korbench_1_revised
* korbench_combined_1
* korbench_combined_1
* kor_combined
* kor_combined
* update
---------
Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
|
2024-11-25 20:11:27 +08:00 |
|
liushz
|
e49fcfd3a3
|
[Update] Update MATH dataset with model judge (#1711)
* Update math with llm judge
* Update math with llm judge
* Update math with llm judge
* Update math with llm judge
* Update math with llm judge
|
2024-11-25 15:14:55 +08:00 |
|
Linchen Xiao
|
ab8fdbbaab
|
[Update] Update Math auto-download data (#1700)
|
2024-11-18 20:24:35 +08:00 |
|
abrohamLee
|
e9e4b69ddb
|
[Feature] MuSR Datset Evaluation (#1689)
* MuSR Datset Evaluation
* MuSR Datset Evaluation
Add an assertion and a Readme.md
|
2024-11-14 20:42:12 +08:00 |
|
Linchen Xiao
|
e92a5d4230
|
[Feature] BABILong Dataset added (#1684)
* update
* update
* update
* update
|
2024-11-14 15:32:43 +08:00 |
|
Linchen Xiao
|
a0ef2fd3b4
|
[Update] Dingo Dataset update (#1670)
* [Update] Dingo Dataset update
* update
|
2024-11-08 14:38:43 +08:00 |
|
Linchen Xiao
|
835bf75a36
|
[Feature] Add long context evaluation for base models (#1666)
* [Update] Add base long context evaluation
* update
|
2024-11-08 10:53:29 +08:00 |
|
liushz
|
f7d899823c
|
[Update] Update mmmlu_lite dataload (#1658)
* update mmmlu_lite dataload from oss
* update mmmlu_lite dataload from oss
|
2024-11-01 17:32:29 +08:00 |
|
Songyang Zhang
|
c789ce5698
|
[Fix] the automatically download for several datasets (#1652)
* [Fix] the automatically download for several datasets
* Update
* Update
* Update CI
|
2024-11-01 15:57:18 +08:00 |
|
bittersweet1999
|
a0853c939d
|
[Add] Add CompassArenaSubjectiveBench (#1645)
* fix pip version
* fix pip version
* add compassarenasubjectivebench
* add compassarenasubjectivebench
* add compassarenabench
|
2024-11-01 13:52:22 +08:00 |
|
Linchen Xiao
|
df57c08ccf
|
[Feature] Update Models, Summarizers (#1600)
|
2024-10-29 18:37:15 +08:00 |
|
Junnan Liu
|
645c5f3b2c
|
[Datasets] Add datasets CMO&AIME (#1610)
* add datasets cmo&aime
* delete unused modules
* modify prompt
* update __init__
* update data load and add README
* update data load
* update performance
* update md5
* remove indents
* add indent
* fix log for debug mode
|
2024-10-28 18:08:02 +08:00 |
|
Linchen Xiao
|
a61e8a0803
|
[Update] Internal humaneval add (#1641)
* [Update] internal_humaneval_add
* update
|
2024-10-25 19:08:42 +08:00 |
|
Linchen Xiao
|
662dddf41a
|
[Update] Add internal humaneval postprocess (#1636)
|
2024-10-24 17:45:21 +08:00 |
|
Songyang Zhang
|
a4d5a6c81b
|
[Feature] Support LiveCodeBench (#1617)
* Update
* Update LCB
* Update
* Update
* Update
* Update
* Update
|
2024-10-21 20:50:39 +08:00 |
|
Chenguang Li
|
5868d5afa4
|
[Bug] Fix-NPU-Support (#1618)
* bugfix NPU support
* formatting
---------
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
|
2024-10-21 17:42:53 +08:00 |
|
Bob Tsang
|
dd0b655bd0
|
[Feature] Support MMMLU & MMMLU-lite Benchmark (#1565)
* rm folder
* modify format according to reviewer
* modify format according to reviewer
* modify format according to reviewer
* add some files requirement
* fix some bug
* fix bug
* change load type
* Update MMMLU Dataset
* Update MMMLU Dataset
* Add MMMLU-Lite Dataset
* update MMMMLU datast
* update MMMMLU datast
* update MMMMLU datast
---------
Co-authored-by: BobTsang <BobTsang1995@gmail.com>
Co-authored-by: liushz <qq1791167085@163.com>
|
2024-10-17 19:09:34 +08:00 |
|
bittersweet1999
|
f0d436496e
|
[Update] update docs and add compassarena (#1614)
* fix pip version
* fix pip version
* update docs and add compassarena
* update docs
|
2024-10-17 14:39:06 +08:00 |
|
Haoran Que
|
4fe251729b
|
Upload HelloBench (#1607)
* upload hellobench
* update hellobench
* update readme.md
* update eval_hellobench.py
* update lastest
---------
Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>
|
2024-10-15 17:11:37 +08:00 |
|
bittersweet1999
|
fa54aa62f6
|
[Feature] Add Judgerbench and reorg subeval (#1593)
* fix pip version
* fix pip version
* update (#1522)
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
* [Feature] Update Models (#1518)
* Update Models
* Update
* Update humanevalx
* Update
* Update
* [Feature] Dataset prompts update for ARC, BoolQ, Race (#1527)
add judgerbench and reorg sub
add judgerbench and reorg subeval
add judgerbench and reorg subeval
* add judgerbench and reorg subeval
* add judgerbench and reorg subeval
* add judgerbench and reorg subeval
* add judgerbench and reorg subeval
---------
Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
|
2024-10-15 16:36:05 +08:00 |
|
liushz
|
5faee929db
|
[Feature] Add GaoKaoMath Dataset for Evaluation & MATH Model Eval Config (#1589)
* Add GaoKaoMath Dataset
* Add MATH LLM Eval
* Update GAOKAO Math Eval Dataset
* Update GAOKAO Math Eval Dataset
|
2024-10-12 19:13:06 +08:00 |
|
bittersweet1999
|
3f7a3730d7
|
[Fix] fix Flames (#1599)
* fix pip version
* fix pip version
* fix flames
* fix flames
|
2024-10-12 14:34:59 +08:00 |
|
Linchen Xiao
|
763d7755b6
|
[BUG]GaokaoBench dataset fix (#1583)
|
2024-09-30 15:13:26 +08:00 |
|
shijinpjlab
|
7528b8ab8a
|
[Feature] Add dingo test (#1529)
* add qa dingo
* update
* change name qa to dingo
* eval model: llm_base
* update path
* change name and move path
* add eval_dingo
* update import
* add for pip
* add dingo package
* change import place
* update import place
* fix lint fail
* isort
* double quoted
---------
Co-authored-by: sj <shijin@pjlab.org.cn>
|
2024-09-29 19:24:58 +08:00 |
|
liushz
|
c9a7026f59
|
[Feature] Update MathBench & WikiBench for FullBench (#1521)
* Update MathBench & WikiBench for FullBench
* Update MathBench & WikiBench for FullBench
* Update GPQA & MMLU_Pro
* Update MathBench & WikiBench for FullBench
* Update MathBench & WikiBench for FullBench
* Update MathBench & WikiBench for FullBench
---------
Co-authored-by: liushz <liuhongwei@pjlab.rog.cn>
|
2024-09-18 14:35:30 +08:00 |
|
Songyang Zhang
|
6997990c93
|
[Feature] Update Models (#1518)
* Update Models
* Update
* Update humanevalx
* Update
* Update
|
2024-09-12 23:35:30 +08:00 |
|
bittersweet1999
|
7c7fa36235
|
[Feature] add support for internal Followbench (#1511)
* fix pip version
* fix pip version
* add internal followbench
* add internal followbench
* fix lint
* fix lint
|
2024-09-11 13:32:34 +08:00 |
|
Linchen Xiao
|
87ffa71d68
|
[Feature] Longbench dataset update
|
2024-09-06 15:50:12 +08:00 |
|
Hari Seldon
|
faf5260155
|
[Feature] Optimize Evaluation Speed of SciCode (#1489)
* update scicode
* update comments
* remove redundant variable
* Update
---------
Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
|
2024-09-06 00:59:41 +08:00 |
|
Linchen Xiao
|
6c9cd9a260
|
[Feature] Needlebench auto-download update (#1480)
* update
* update
* update
|
2024-09-05 17:22:42 +08:00 |
|
Linchen Xiao
|
9693be46b7
|
[Feature] Mmlu-pro auto-download (#1464)
* update
* update
* update
* update
* update
|
2024-08-30 10:03:40 +08:00 |
|
Linchen Xiao
|
245664f4c0
|
[Feature] Fullbench v0.1 language update (#1463)
* update
* update
* update
* update
|
2024-08-28 14:01:05 +08:00 |
|
Songyang Zhang
|
7c2d25b557
|
[Fix] Update SciCode and Gemma model (#1449)
* [Fix] Update SciCode and Gemma model
* Update
* Update
|
2024-08-23 10:42:27 +08:00 |
|
Hari Seldon
|
14b4b735cb
|
[Feature] Add support for SciCode (#1417)
* add SciCode
* add SciCode
* add SciCode
* add SciCode
* add SciCode
* add SciCode
* add SciCode
* add SciCode w/ bg
* add scicode
* Update README.md
* Update README.md
* Delete configs/eval_SciCode.py
* rename
* 1
* rename
* Update README.md
* Update scicode.py
* Update scicode.py
* fix some bugs
* Update
* Update
---------
Co-authored-by: root <HariSeldon0>
Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
|
2024-08-22 13:42:25 +08:00 |
|
Linchen Xiao
|
a4b54048ae
|
[Feature] Add Ruler datasets (#1310)
* [Feature] Add Ruler datasets
* pre-commit fixed
* Add model specific tokenizer to dataset
* pre-commit modified
* remove unused import
* fix linting
* add trust_remote to tokenizer load
* lint fix
* comments resolved
* fix lint
* Add readme
* Fix lint
* ruler refactorize
* fix lint
* lint fix
* updated
* lint fix
* fix wonderwords import issue
* prompt modified
* update
* readme updated
* update
* ruler dataset added
* Update
---------
Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
|
2024-08-20 11:40:11 +08:00 |
|
Linchen Xiao
|
ecf9bb3e4c
|
[Bug] Commonsenseqa dataset fix (#1425)
* longbench dataset load fix
* update
* Update
* Update
* Update
* update
* update
---------
Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
|
2024-08-16 15:54:07 +08:00 |
|
Songyang Zhang
|
9b3613f10b
|
[Update] Support auto-download of FOFO/MT-Bench-101 (#1423)
* [Update] Support auto-download of FOFO/MT-Bench-101
* Update wildbench
|
2024-08-16 11:57:41 +08:00 |
|
Linchen Xiao
|
2596f226f4
|
[Fix] longbench dataset load fix (#1422)
|
2024-08-15 11:30:30 +08:00 |
|
Linchen Xiao
|
8e55c9c6ee
|
[Update] Compassbench v1.3 (#1396)
* stash files
* compassbench subjective evaluation added
* evaluation update
* fix lint
* update docs
* Update lint
* changes saved
* changes saved
* CompassBench subjective summarizer added (#1349)
* subjective summarizer added
* fix lint
[Fix] Fix MathBench (#1351)
Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
[Update] Update model support list (#1353)
* fix pip version
* fix pip version
* update model support
subjective summarizer updated
knowledge, math objective done (data need update)
remove secrets
objective changes saved
knowledge data added
* secrets removed
* changed added
* summarizer modified
* summarizer modified
* compassbench coding added
* fix lint
* objective summarizer updated
* compass_bench_v1.3 updated
* update files in config folder
* remove unused model
* lcbench modified
* removed model evaluation configs
* remove duplicated sdk implementation
---------
Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn>
|
2024-08-12 19:09:19 +08:00 |
|
yaoyingyy
|
decb621ff6
|
[Fix] the issue where scores are negative in the Lawbench dataset evaluation(#1402) (#1403)
|
2024-08-08 16:08:26 +08:00 |
|