Linchen Xiao
408f5caff4
[Dataset] Add SuperGPQA subfield configs ( #2124 )
...
* update
* fix lint
* fix lint
* update precommit
* update precommit
* fix lint
2025-05-28 14:12:58 +08:00
zhulinJulia24
c3779ebfc1
[ci] update dlc setting ( #2112 )
2025-05-22 16:47:57 +08:00
zhulinJulia24
f982d6278e
[CI] fix baseline score ( #2000 )
...
* update
* update
* update
* update
* update
* update
* update
* updaste
* update
* update
* updaste
* updaste
* update
* update
* update
* update
* update
* update
* update
* update
2025-04-03 19:32:36 +08:00
Linchen Xiao
0b7f76e193
[Bug] Fix Summarizer logic ( #1953 )
2025-03-17 18:25:08 +08:00
Yufeng Zhao
15c825a51a
[Update] Bbeh harmony summarizer added ( #1951 )
...
* bbeh
* bbeh
* fix_smallbugs_bbeh
* removeprint
* harmonic
* update_summerizer
* harmonic-tested
* harmonic-tested
* clean
* clean
* cleaned_rebased
---------
Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2025-03-17 17:19:56 +08:00
Linchen Xiao
d7daee6e25
[Update] OpenAI model update, bigcodebench update ( #1879 )
...
* [Update] Openai model update, bigcodebench update
* update
2025-02-20 19:33:25 +08:00
Alexander Lam
7f2aeeff26
added predicted win rates reporting to bradley terry subj eval methods with an option to switch between win rates and elo ratings ( #1815 )
2025-01-10 18:20:25 +08:00
Alexander Lam
dc6035cfcb
[Feature] Added Bradley-Terry subjective evaluation
2024-12-31 11:01:23 +08:00
Songyang Zhang
98435dd98e
[Feature] Update o1 evaluation with JudgeLLM ( #1795 )
...
* Update Generic LLM Evaluator
* Update o1 style evaluator
2024-12-30 17:31:00 +08:00
bittersweet1999
357ce8c7a4
[Fix] Fix model summarizer abbr ( #1789 )
...
* fix pip version
* fix pip version
* fix model summarizer abbr
---------
Co-authored-by: root <bittersweet1999>
2024-12-27 14:45:08 +08:00
bittersweet1999
38dba9919b
[Fix] Fix Subjective summarizer order error ( #1767 )
...
* fix pip version
* fix pip version
* fix order error
2024-12-18 13:21:31 +08:00
Alexander Lam
1bd594fc62
[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model ( #1751 )
...
* fix lint issues
* updated gitignore
* changed infer_order from random to double for the pairwise_judge.py (not changing for pairwise_bt_judge.py
* added return statement to CompassArenaBradleyTerrySummarizer to return overall score for each judger model
2024-12-16 13:41:28 +08:00
zhulinJulia24
aeded4c4db
add new dataset summerizer ( #1758 )
...
add new dataset summerizer
2024-12-13 09:50:43 +08:00
zhulinJulia24
a1c00cc8b7
[ci] add common_summarizer return ( #1724 )
...
* Update common_summarizer.py
* Update common_summarizer.py
2024-12-11 20:38:32 +08:00
bittersweet1999
08d63b5bf3
[Fix] Fix error in subjective default summarizer ( #1740 )
...
* fix pip version
* fix pip version
* fix summarizer bug
2024-12-06 11:03:53 +08:00
Chang Cheng
fd7aa83c01
[Update] Update DLC Runner( #1662 )
...
* push interntrain hard code
* push interntrain hard code
* remove redundant post process
---------
Co-authored-by: changcheng <changcheng@pjlab.org.cb>
Co-authored-by: changcheng <changcheng@pjlab.org.cn>
2024-11-07 15:45:35 +08:00
Linchen Xiao
df57c08ccf
[Feature] Update Models, Summarizers ( #1600 )
2024-10-29 18:37:15 +08:00
BigDong
2542bc6907
[Feature] Support results saving as md format table ( #1638 )
2024-10-25 15:50:33 +08:00
Linchen Xiao
be3c06a158
[Fix] Update common summarizer regex extraction ( #1631 )
2024-10-22 14:35:45 +08:00
Haoran Que
4fe251729b
Upload HelloBench ( #1607 )
...
* upload hellobench
* update hellobench
* update readme.md
* update eval_hellobench.py
* update lastest
---------
Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>
2024-10-15 17:11:37 +08:00
bittersweet1999
fa54aa62f6
[Feature] Add Judgerbench and reorg subeval ( #1593 )
...
* fix pip version
* fix pip version
* update (#1522 )
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
* [Feature] Update Models (#1518 )
* Update Models
* Update
* Update humanevalx
* Update
* Update
* [Feature] Dataset prompts update for ARC, BoolQ, Race (#1527 )
add judgerbench and reorg sub
add judgerbench and reorg subeval
add judgerbench and reorg subeval
* add judgerbench and reorg subeval
* add judgerbench and reorg subeval
* add judgerbench and reorg subeval
* add judgerbench and reorg subeval
---------
Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2024-10-15 16:36:05 +08:00
bittersweet1999
3f7a3730d7
[Fix] fix Flames ( #1599 )
...
* fix pip version
* fix pip version
* fix flames
* fix flames
2024-10-12 14:34:59 +08:00
zhulinJulia24
87df8a73a3
[CI] add a common summarizer for qabench summarizer ( #1545 )
...
* update
* update
* update
---------
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-09-25 13:40:47 +08:00
bittersweet1999
7c7fa36235
[Feature] add support for internal Followbench ( #1511 )
...
* fix pip version
* fix pip version
* add internal followbench
* add internal followbench
* fix lint
* fix lint
2024-09-11 13:32:34 +08:00
bittersweet1999
c2bcd8725e
[Fix] Fix wildbench ( #1508 )
...
* fix pip version
* fix pip version
* fix_wildbench
2024-09-10 17:35:07 +08:00
bittersweet1999
ce7f4853ce
[Fix] Sub summarizer order fix ( #1426 )
...
* fix pip version
* fix pip version
* fix sub summarizer order
* fix order
2024-08-15 21:08:18 +08:00
Linchen Xiao
8e55c9c6ee
[Update] Compassbench v1.3 ( #1396 )
...
* stash files
* compassbench subjective evaluation added
* evaluation update
* fix lint
* update docs
* Update lint
* changes saved
* changes saved
* CompassBench subjective summarizer added (#1349 )
* subjective summarizer added
* fix lint
[Fix] Fix MathBench (#1351 )
Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
[Update] Update model support list (#1353 )
* fix pip version
* fix pip version
* update model support
subjective summarizer updated
knowledge, math objective done (data need update)
remove secrets
objective changes saved
knowledge data added
* secrets removed
* changed added
* summarizer modified
* summarizer modified
* compassbench coding added
* fix lint
* objective summarizer updated
* compass_bench_v1.3 updated
* update files in config folder
* remove unused model
* lcbench modified
* removed model evaluation configs
* remove duplicated sdk implementation
---------
Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn>
2024-08-12 19:09:19 +08:00
Songyang Zhang
704853e5e7
[Feature] Update pip install ( #1324 )
...
* [Feature] Update pip install
* Update Configuration
* Update
* Update
* Update
* Update Internal Config
* Update collect env
2024-07-29 18:32:50 +08:00
jxd
12b84aeb3b
[Feature] Update CHARM Memeorziation ( #1230 )
...
* update gemini api and add gemini models
* add openai models
* update CHARM evaluation
* add CHARM memorization tasks
* add CharmMemSummarizer (output eval details for memorization-independent reasoning analysis
* update CHARM readme
---------
Co-authored-by: wujiang <wujiang@pjlab.org.cn>
2024-07-26 18:42:30 +08:00
WANG WENJIN
0aad8199c7
Fix the summary error in subjective.py ( #1363 )
2024-07-25 18:36:13 +08:00
Linchen Xiao
8127fc3518
CompassBench subjective summarizer added ( #1349 )
...
* subjective summarizer added
* fix lint
2024-07-23 12:29:57 +08:00
Mo Li
104bddf647
[Doc] Update NeedleBench Docs ( #1330 )
...
* update needlebench docs
* update model_name_mapping dict
* update README
* Update README_zh-CN.md
---------
Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2024-07-18 13:16:19 +08:00
bittersweet1999
8e7ad2e981
[Fix] add bc for alignbench summarizer ( #1306 )
...
* fix pip version
* fix pip version
* fix alignbench
* fix import error
2024-07-12 11:06:20 +08:00
bittersweet1999
68ca48496b
[Refactor] Reorganize subjective eval ( #1284 )
...
* fix pip version
* fix pip version
* reorganize subjective eval
* reorg sub
* reorg subeval
* reorg subeval
* update subjective doc
* reorg subeval
* reorg subeval
2024-07-05 22:11:37 +08:00
Fengzhe Zhou
a32f21a356
[Sync] Sync with internal codes 2024.06.28 ( #1279 )
2024-06-28 14:16:34 +08:00
klein
1fa62c4a42
Support wildbench ( #1266 )
...
Co-authored-by: Leymore <zfz-960727@163.com>
2024-06-24 13:16:27 +08:00
bittersweet1999
982e024540
[Feature] add dataset Fofo ( #1224 )
...
* add fofo dataset
* add dataset fofo
2024-06-06 11:40:48 +08:00
Xingyuan Bu
02a0a4e857
MT-Bench-101 ( #1215 )
...
* add mt-bench-101
* add readme and requirements
* add mt-bench-101 data
* Update readme_mtbench101.md
* update readme
* update leaderboard
* fix typo
* Update readme_mtbench101.md
* fit newest opencompass
* update readme.md
* mtbench101 to opencompass
* mtbench101 to opencompass
* for code review
* for code review
* for code review
* hook
* hook
---------
Co-authored-by: liujie <ljie@buaa.edu.cn>
2024-06-03 14:52:12 +08:00
bittersweet1999
7c381e5be8
[Fix] fix summarizer ( #1217 )
...
* fix summarizer
* fix summarizer
2024-05-31 11:40:47 +08:00
Fengzhe Zhou
a77b8a5cec
[Sync] format ( #1214 )
2024-05-30 00:21:58 +08:00
Fengzhe Zhou
2954913d9b
[Sync] bump version ( #1204 )
2024-05-28 23:09:59 +08:00
bittersweet1999
8a8987be0b
fix arenahard summarizer ( #1154 )
...
Co-authored-by: Leymore <zfz-960727@163.com>
2024-05-15 13:31:29 +08:00
Fengzhe Zhou
7505b3cadf
[Feature] Add huggingface apply_chat_template ( #1098 )
...
* add TheoremQA with 5-shot
* add huggingface_above_v4_33 classes
* use num_worker partitioner in cli
* update theoremqa
* update TheoremQA
* add TheoremQA
* rename theoremqa -> TheoremQA
* update TheoremQA output path
* rewrite many model configs
* update huggingface
* further update
* refine configs
* update configs
* update configs
* add configs/eval_llama3_instruct.py
* add summarizer multi faceted
* update bbh datasets
* update configs/models/hf_llama/lmdeploy_llama3_8b_instruct.py
* rename class
* update readme
* update hf above v4.33
2024-05-14 14:50:16 +08:00
Mo Li
6c711cb262
[Fix] Fix Needlebench Summarizer ( #1143 )
...
* update few-shot example
* add 128k
2024-05-13 15:59:34 +08:00
Alexander Lam
35c94d0cde
[Feature] Adding support for LLM Compression Evaluation ( #1108 )
...
* fixed formatting based on pre-commit tests
* fixed typo in comments; reduced the number of models in the eval config
* fixed a bug in LLMCompressionDataset, where setting samples=None would result in passing test[:None] to load_dataset
* removed unnecessary variable in _format_table_pivot; changed lark_reporter message to English
2024-04-30 10:51:01 +08:00
liushz
a6f67e1a65
[Fix] Fix Math Evaluation with Judge Model Evaluator & Add README ( #1103 )
...
* Add Math Evaluation with Judge Model Evaluator
* Add Math Evaluation with Judge Model Evaluator
* Add Math Evaluation with Judge Model Evaluator
* Add Math Evaluation with Judge Model Evaluator
* Fix Llama-3 meta template
* Fix MATH with JudgeLM Evaluation
* Fix MATH with JudgeLM Evaluation
* Fix MATH with JudgeLM Evaluation
* Fix MATH with JudgeLM Evaluation
---------
Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-04-28 21:58:58 +08:00
Yggdrasill7D6
58a57a4c45
[Feature] add support for Flames datasets ( #1093 )
...
* add flames datasets
* fix lint
* rm quota
* add judgemodel info and fix os path
* support flames dataset
* support flames dataset
---------
Co-authored-by: bittersweet1999 <1487910649@qq.com>
2024-04-28 18:56:24 +08:00
klein
e4830a6926
Update CIBench ( #1089 )
...
* modify the requirements/runtime.txt: numpy==1.23.4 --> numpy>=1.23.4
* update cibench: dataset and evluation
* cibench summarizer bug
* update cibench
* move extract_code import
---------
Co-authored-by: zhangchuyu@pjlab.org.cn <zhangchuyu@pjlab.org.cn>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-04-26 18:46:02 +08:00
bittersweet1999
e404b72c52
[Feature] support arenahard evaluation ( #1096 )
...
* support arenahard
* support arenahard
* support arenahard
2024-04-26 15:42:00 +08:00
bittersweet1999
6ba1c4937d
[Feature] Support Math evaluation via judgemodel ( #1094 )
...
* support openai math evaluation
* support openai math evaluation
* support openai math evaluation
* support math llm judge
* support math llm judge
2024-04-26 14:56:23 +08:00