Alexander Lam
1bd594fc62
[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model ( #1751 )
...
* fix lint issues
* updated gitignore
* changed infer_order from random to double for the pairwise_judge.py (not changing for pairwise_bt_judge.py
* added return statement to CompassArenaBradleyTerrySummarizer to return overall score for each judger model
2024-12-16 13:41:28 +08:00
zhulinJulia24
aeded4c4db
add new dataset summerizer ( #1758 )
...
add new dataset summerizer
2024-12-13 09:50:43 +08:00
OpenStellarTeam
1a5b3fc11e
Add Chinese SimpleQA config ( #1697 )
...
* add chinese simpleqa config
* add chinese simpleqa config
* add chinese simpleqa config
* add chinese simpleqa config
* Update CsimpleQA
* Update CsimpleQA
* Update CsimpleQA
* Update CsimpleQA
* Update CsimpleQA
* Update CsimpleQA
* pdate Csimpleqa
---------
Co-authored-by: 明念 <heyancheng.hyc@taobao.com>
Co-authored-by: liushz <qq1791167085@163.com>
2024-12-11 18:03:39 +08:00
bittersweet1999
54c0fb7a93
[Change] Change Compassarena metric ( #1749 )
...
* fix pip version
* fix pip version
* fix summarizer bug
* fix compassarena
* fix compassarena
* fix compassarena
2024-12-10 14:45:32 +08:00
Songyang Zhang
0d8df541bc
[Update] Update O1-style Benchmark and Prompts ( #1742 )
...
* Update JuderBench
* Support O1-style Prompts
* Update Code
* Update OpenAI
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update
* Update
* Update
* Update
2024-12-09 13:48:56 +08:00
Junnan Liu
f333be177c
[Update] Add MATH500 & AIME2024 to LiveMathBench ( #1741 )
...
* upload dataset definitions & configs
* add single dataset split specific metrics
* add k-pass@threshold & MATH500
* update std computation & k-pass computation
* add AIME224
* update README
2024-12-06 14:36:49 +08:00
Songyang Zhang
fb43dd1906
[Update] Update Skywork/Qwen-QwQ ( #1728 )
...
* Update JuderBench
* Support O1-style Prompts
* Update Code
* Update OpenAI
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update
2024-12-05 19:30:43 +08:00
Junnan Liu
6181ac1122
[Update] Update LiveMathBench Evaluation to Support Single Dataset Split Metric Computation ( #1730 )
...
* upload dataset definitions & configs
* add single dataset split specific metrics
* add k-pass@threshold & MATH500
2024-12-05 16:54:16 +08:00
Linchen Xiao
ac23f0ce1f
[Update] Update init file for Korbench ( #1737 )
2024-12-05 11:26:00 +08:00
Linchen Xiao
9de27b4d85
[Update] Update max_out_len for datasets ( #1726 )
...
* [Update] Update max_out_len for datasets
* Update eval_regression_chat_objective_fullbench.py
* Update eval_regression_chat.py
* Update eval_regression_chat.py
* Update oc_score_baseline_fullbench.yaml
---------
Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>
2024-12-02 11:42:07 +08:00
Junnan Liu
fe6d76fb13
[Feature] Support LiveMathBench ( #1727 )
2024-11-30 00:07:19 +08:00
liushz
c437135fad
[Feature] Add Openai Simpleqa dataset ( #1720 )
...
* Add Openai SimpleQA dataset
* Add Openai SimpleQA dataset
* Add Openai SimpleQA dataset
* Update eval_simpleqa.py
---------
Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2024-11-28 19:16:07 +08:00
wanyu2018umac
90efcf2216
[Feature] Add P-MMEval ( #1714 )
...
* Update with PMMEval
* Update
* Update __init__.py
* Fix Bugs
* Delete .pre-commit-config.yaml
* Pull merge
---------
Co-authored-by: liushz <qq1791167085@163.com>
2024-11-27 21:26:18 +08:00
Junnan Liu
f7dbe6bb7d
[Feature] Add Arc Prize Public Evaluation ( #1690 )
...
* support arc prize
* update arc-prize dataset info & update arc-prize evaluation performance
2024-11-27 15:44:41 +08:00
Linchen Xiao
ef695e28e5
[Bug] Fix Korbench dataset module ( #1717 )
2024-11-26 17:13:28 +08:00
Songyang Zhang
f97c4eae42
[Update] Update Fullbench ( #1712 )
...
* Update JuderBench
* Support O1-style Prompts
* Update Code
2024-11-26 14:26:55 +08:00
Yufeng Zhao
300adc31e8
[Feature] Add Korbench dataset ( #1713 )
...
* first version for korbench
* first stage for korbench
* korbench_1
* korbench_1
* korbench_1
* korbench_1
* korbench_1_revised
* korbench_combined_1
* korbench_combined_1
* kor_combined
* kor_combined
* update
---------
Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2024-11-25 20:11:27 +08:00
liushz
e49fcfd3a3
[Update] Update MATH dataset with model judge ( #1711 )
...
* Update math with llm judge
* Update math with llm judge
* Update math with llm judge
* Update math with llm judge
* Update math with llm judge
2024-11-25 15:14:55 +08:00
Linchen Xiao
ab8fdbbaab
[Update] Update Math auto-download data ( #1700 )
2024-11-18 20:24:35 +08:00
abrohamLee
e9e4b69ddb
[Feature] MuSR Datset Evaluation ( #1689 )
...
* MuSR Datset Evaluation
* MuSR Datset Evaluation
Add an assertion and a Readme.md
2024-11-14 20:42:12 +08:00
Linchen Xiao
e92a5d4230
[Feature] BABILong Dataset added ( #1684 )
...
* update
* update
* update
* update
2024-11-14 15:32:43 +08:00
Linchen Xiao
a0ef2fd3b4
[Update] Dingo Dataset update ( #1670 )
...
* [Update] Dingo Dataset update
* update
2024-11-08 14:38:43 +08:00
Linchen Xiao
835bf75a36
[Feature] Add long context evaluation for base models ( #1666 )
...
* [Update] Add base long context evaluation
* update
2024-11-08 10:53:29 +08:00
liushz
f7d899823c
[Update] Update mmmlu_lite dataload ( #1658 )
...
* update mmmlu_lite dataload from oss
* update mmmlu_lite dataload from oss
2024-11-01 17:32:29 +08:00
Songyang Zhang
c789ce5698
[Fix] the automatically download for several datasets ( #1652 )
...
* [Fix] the automatically download for several datasets
* Update
* Update
* Update CI
2024-11-01 15:57:18 +08:00
bittersweet1999
a0853c939d
[Add] Add CompassArenaSubjectiveBench ( #1645 )
...
* fix pip version
* fix pip version
* add compassarenasubjectivebench
* add compassarenasubjectivebench
* add compassarenabench
2024-11-01 13:52:22 +08:00
Linchen Xiao
df57c08ccf
[Feature] Update Models, Summarizers ( #1600 )
2024-10-29 18:37:15 +08:00
Junnan Liu
645c5f3b2c
[Datasets] Add datasets CMO&AIME ( #1610 )
...
* add datasets cmo&aime
* delete unused modules
* modify prompt
* update __init__
* update data load and add README
* update data load
* update performance
* update md5
* remove indents
* add indent
* fix log for debug mode
2024-10-28 18:08:02 +08:00
Linchen Xiao
a61e8a0803
[Update] Internal humaneval add ( #1641 )
...
* [Update] internal_humaneval_add
* update
2024-10-25 19:08:42 +08:00
Linchen Xiao
662dddf41a
[Update] Add internal humaneval postprocess ( #1636 )
2024-10-24 17:45:21 +08:00
Songyang Zhang
a4d5a6c81b
[Feature] Support LiveCodeBench ( #1617 )
...
* Update
* Update LCB
* Update
* Update
* Update
* Update
* Update
2024-10-21 20:50:39 +08:00
Chenguang Li
5868d5afa4
[Bug] Fix-NPU-Support ( #1618 )
...
* bugfix NPU support
* formatting
---------
Co-authored-by: noemotiovon <noemotiovon@gmail.com>
2024-10-21 17:42:53 +08:00
Bob Tsang
dd0b655bd0
[Feature] Support MMMLU & MMMLU-lite Benchmark ( #1565 )
...
* rm folder
* modify format according to reviewer
* modify format according to reviewer
* modify format according to reviewer
* add some files requirement
* fix some bug
* fix bug
* change load type
* Update MMMLU Dataset
* Update MMMLU Dataset
* Add MMMLU-Lite Dataset
* update MMMMLU datast
* update MMMMLU datast
* update MMMMLU datast
---------
Co-authored-by: BobTsang <BobTsang1995@gmail.com>
Co-authored-by: liushz <qq1791167085@163.com>
2024-10-17 19:09:34 +08:00
bittersweet1999
f0d436496e
[Update] update docs and add compassarena ( #1614 )
...
* fix pip version
* fix pip version
* update docs and add compassarena
* update docs
2024-10-17 14:39:06 +08:00
Haoran Que
4fe251729b
Upload HelloBench ( #1607 )
...
* upload hellobench
* update hellobench
* update readme.md
* update eval_hellobench.py
* update lastest
---------
Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>
2024-10-15 17:11:37 +08:00
bittersweet1999
fa54aa62f6
[Feature] Add Judgerbench and reorg subeval ( #1593 )
...
* fix pip version
* fix pip version
* update (#1522 )
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
* [Feature] Update Models (#1518 )
* Update Models
* Update
* Update humanevalx
* Update
* Update
* [Feature] Dataset prompts update for ARC, BoolQ, Race (#1527 )
add judgerbench and reorg sub
add judgerbench and reorg subeval
add judgerbench and reorg subeval
* add judgerbench and reorg subeval
* add judgerbench and reorg subeval
* add judgerbench and reorg subeval
* add judgerbench and reorg subeval
---------
Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2024-10-15 16:36:05 +08:00
liushz
5faee929db
[Feature] Add GaoKaoMath Dataset for Evaluation & MATH Model Eval Config ( #1589 )
...
* Add GaoKaoMath Dataset
* Add MATH LLM Eval
* Update GAOKAO Math Eval Dataset
* Update GAOKAO Math Eval Dataset
2024-10-12 19:13:06 +08:00
bittersweet1999
3f7a3730d7
[Fix] fix Flames ( #1599 )
...
* fix pip version
* fix pip version
* fix flames
* fix flames
2024-10-12 14:34:59 +08:00
Linchen Xiao
763d7755b6
[BUG]GaokaoBench dataset fix ( #1583 )
2024-09-30 15:13:26 +08:00
shijinpjlab
7528b8ab8a
[Feature] Add dingo test ( #1529 )
...
* add qa dingo
* update
* change name qa to dingo
* eval model: llm_base
* update path
* change name and move path
* add eval_dingo
* update import
* add for pip
* add dingo package
* change import place
* update import place
* fix lint fail
* isort
* double quoted
---------
Co-authored-by: sj <shijin@pjlab.org.cn>
2024-09-29 19:24:58 +08:00
liushz
c9a7026f59
[Feature] Update MathBench & WikiBench for FullBench ( #1521 )
...
* Update MathBench & WikiBench for FullBench
* Update MathBench & WikiBench for FullBench
* Update GPQA & MMLU_Pro
* Update MathBench & WikiBench for FullBench
* Update MathBench & WikiBench for FullBench
* Update MathBench & WikiBench for FullBench
---------
Co-authored-by: liushz <liuhongwei@pjlab.rog.cn>
2024-09-18 14:35:30 +08:00
Songyang Zhang
6997990c93
[Feature] Update Models ( #1518 )
...
* Update Models
* Update
* Update humanevalx
* Update
* Update
2024-09-12 23:35:30 +08:00
bittersweet1999
7c7fa36235
[Feature] add support for internal Followbench ( #1511 )
...
* fix pip version
* fix pip version
* add internal followbench
* add internal followbench
* fix lint
* fix lint
2024-09-11 13:32:34 +08:00
Linchen Xiao
87ffa71d68
[Feature] Longbench dataset update
2024-09-06 15:50:12 +08:00
Hari Seldon
faf5260155
[Feature] Optimize Evaluation Speed of SciCode ( #1489 )
...
* update scicode
* update comments
* remove redundant variable
* Update
---------
Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
2024-09-06 00:59:41 +08:00
Linchen Xiao
6c9cd9a260
[Feature] Needlebench auto-download update ( #1480 )
...
* update
* update
* update
2024-09-05 17:22:42 +08:00
Linchen Xiao
9693be46b7
[Feature] Mmlu-pro auto-download ( #1464 )
...
* update
* update
* update
* update
* update
2024-08-30 10:03:40 +08:00
Linchen Xiao
245664f4c0
[Feature] Fullbench v0.1 language update ( #1463 )
...
* update
* update
* update
* update
2024-08-28 14:01:05 +08:00
Songyang Zhang
7c2d25b557
[Fix] Update SciCode and Gemma model ( #1449 )
...
* [Fix] Update SciCode and Gemma model
* Update
* Update
2024-08-23 10:42:27 +08:00
Hari Seldon
14b4b735cb
[Feature] Add support for SciCode ( #1417 )
...
* add SciCode
* add SciCode
* add SciCode
* add SciCode
* add SciCode
* add SciCode
* add SciCode
* add SciCode w/ bg
* add scicode
* Update README.md
* Update README.md
* Delete configs/eval_SciCode.py
* rename
* 1
* rename
* Update README.md
* Update scicode.py
* Update scicode.py
* fix some bugs
* Update
* Update
---------
Co-authored-by: root <HariSeldon0>
Co-authored-by: tonysy <sy.zhangbuaa@gmail.com>
2024-08-22 13:42:25 +08:00