Commit Graph

89 Commits

Author SHA1 Message Date
Linchen Xiao
408f5caff4
[Dataset] Add SuperGPQA subfield configs (#2124)
* update

* fix lint

* fix lint

* update precommit

* update precommit

* fix lint
2025-05-28 14:12:58 +08:00
zhulinJulia24
c3779ebfc1
[ci] update dlc setting (#2112) 2025-05-22 16:47:57 +08:00
zhulinJulia24
f982d6278e
[CI] fix baseline score (#2000)
* update

* update

* update

* update

* update

* update

* update

* updaste

* update

* update

* updaste

* updaste

* update

* update

* update

* update

* update

* update

* update

* update
2025-04-03 19:32:36 +08:00
Linchen Xiao
0b7f76e193
[Bug] Fix Summarizer logic (#1953) 2025-03-17 18:25:08 +08:00
Yufeng Zhao
15c825a51a
[Update] Bbeh harmony summarizer added (#1951)
* bbeh

* bbeh

* fix_smallbugs_bbeh

* removeprint

* harmonic

* update_summerizer

* harmonic-tested

* harmonic-tested

* clean

* clean

* cleaned_rebased

---------

Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2025-03-17 17:19:56 +08:00
Linchen Xiao
d7daee6e25
[Update] OpenAI model update, bigcodebench update (#1879)
* [Update] Openai model update, bigcodebench update

* update
2025-02-20 19:33:25 +08:00
Alexander Lam
7f2aeeff26
added predicted win rates reporting to bradley terry subj eval methods with an option to switch between win rates and elo ratings (#1815) 2025-01-10 18:20:25 +08:00
Alexander Lam
dc6035cfcb
[Feature] Added Bradley-Terry subjective evaluation 2024-12-31 11:01:23 +08:00
Songyang Zhang
98435dd98e
[Feature] Update o1 evaluation with JudgeLLM (#1795)
* Update Generic LLM Evaluator

* Update o1 style evaluator
2024-12-30 17:31:00 +08:00
bittersweet1999
357ce8c7a4
[Fix] Fix model summarizer abbr (#1789)
* fix pip version

* fix pip version

* fix model summarizer abbr

---------

Co-authored-by: root <bittersweet1999>
2024-12-27 14:45:08 +08:00
bittersweet1999
38dba9919b
[Fix] Fix Subjective summarizer order error (#1767)
* fix pip version

* fix pip version

* fix order error
2024-12-18 13:21:31 +08:00
Alexander Lam
1bd594fc62
[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model (#1751)
* fix lint issues

* updated gitignore

* changed infer_order from random to double for the pairwise_judge.py (not changing for pairwise_bt_judge.py

* added return statement to CompassArenaBradleyTerrySummarizer to return overall score for each judger model
2024-12-16 13:41:28 +08:00
zhulinJulia24
aeded4c4db
add new dataset summerizer (#1758)
add new dataset summerizer
2024-12-13 09:50:43 +08:00
zhulinJulia24
a1c00cc8b7
[ci] add common_summarizer return (#1724)
* Update common_summarizer.py

* Update common_summarizer.py
2024-12-11 20:38:32 +08:00
bittersweet1999
08d63b5bf3
[Fix] Fix error in subjective default summarizer (#1740)
* fix pip version

* fix pip version

* fix summarizer bug
2024-12-06 11:03:53 +08:00
Chang Cheng
fd7aa83c01
[Update] Update DLC Runner(#1662)
* push interntrain hard code

* push interntrain hard code

* remove redundant post process

---------

Co-authored-by: changcheng <changcheng@pjlab.org.cb>
Co-authored-by: changcheng <changcheng@pjlab.org.cn>
2024-11-07 15:45:35 +08:00
Linchen Xiao
df57c08ccf
[Feature] Update Models, Summarizers (#1600) 2024-10-29 18:37:15 +08:00
BigDong
2542bc6907
[Feature] Support results saving as md format table (#1638) 2024-10-25 15:50:33 +08:00
Linchen Xiao
be3c06a158
[Fix] Update common summarizer regex extraction (#1631) 2024-10-22 14:35:45 +08:00
Haoran Que
4fe251729b
Upload HelloBench (#1607)
* upload hellobench

* update hellobench

* update readme.md

* update eval_hellobench.py

* update lastest

---------

Co-authored-by: bittersweet1999 <148421775+bittersweet1999@users.noreply.github.com>
2024-10-15 17:11:37 +08:00
bittersweet1999
fa54aa62f6
[Feature] Add Judgerbench and reorg subeval (#1593)
* fix pip version

* fix pip version

* update (#1522)

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>

* [Feature] Update Models (#1518)

* Update Models

* Update

* Update humanevalx

* Update

* Update

* [Feature] Dataset prompts update for ARC, BoolQ, Race (#1527)

add judgerbench and reorg sub

add judgerbench and reorg subeval

add judgerbench and reorg subeval

* add judgerbench and reorg subeval

* add judgerbench and reorg subeval

* add judgerbench and reorg subeval

* add judgerbench and reorg subeval

---------

Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>
Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
Co-authored-by: Linchen Xiao <xxllcc1993@gmail.com>
2024-10-15 16:36:05 +08:00
bittersweet1999
3f7a3730d7
[Fix] fix Flames (#1599)
* fix pip version

* fix pip version

* fix flames

* fix flames
2024-10-12 14:34:59 +08:00
zhulinJulia24
87df8a73a3
[CI] add a common summarizer for qabench summarizer (#1545)
* update

* update

* update

---------

Co-authored-by: zhulin1 <zhulin1@pjlab.org.cn>
2024-09-25 13:40:47 +08:00
bittersweet1999
7c7fa36235
[Feature] add support for internal Followbench (#1511)
* fix pip version

* fix pip version

* add internal followbench

* add internal followbench

* fix lint

* fix lint
2024-09-11 13:32:34 +08:00
bittersweet1999
c2bcd8725e
[Fix] Fix wildbench (#1508)
* fix pip version

* fix pip version

* fix_wildbench
2024-09-10 17:35:07 +08:00
bittersweet1999
ce7f4853ce
[Fix] Sub summarizer order fix (#1426)
* fix pip version

* fix pip version

* fix sub summarizer order

* fix order
2024-08-15 21:08:18 +08:00
Linchen Xiao
8e55c9c6ee
[Update] Compassbench v1.3 (#1396)
* stash files

* compassbench subjective evaluation added

* evaluation update

* fix lint

* update docs

* Update lint

* changes saved

* changes saved

* CompassBench subjective summarizer added (#1349)

* subjective summarizer added

* fix lint

[Fix] Fix MathBench (#1351)

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>

[Update] Update model support list (#1353)

* fix pip version

* fix pip version

* update model support

subjective summarizer updated

knowledge, math objective done (data need update)

remove secrets

objective changes saved

knowledge data added

* secrets removed

* changed added

* summarizer modified

* summarizer modified

* compassbench coding added

* fix lint

* objective summarizer updated

* compass_bench_v1.3 updated

* update files in config folder

* remove unused model

* lcbench modified

* removed model evaluation configs

* remove duplicated sdk implementation

---------

Co-authored-by: zhangsongyang <zhangsongyang@pjlab.org.cn>
2024-08-12 19:09:19 +08:00
Songyang Zhang
704853e5e7
[Feature] Update pip install (#1324)
* [Feature] Update pip install

* Update Configuration

* Update

* Update

* Update

* Update Internal Config

* Update collect env
2024-07-29 18:32:50 +08:00
jxd
12b84aeb3b
[Feature] Update CHARM Memeorziation (#1230)
* update gemini api and add gemini models

* add openai models

* update CHARM evaluation

* add CHARM memorization tasks

* add CharmMemSummarizer (output eval details for memorization-independent reasoning analysis

* update CHARM readme

---------

Co-authored-by: wujiang <wujiang@pjlab.org.cn>
2024-07-26 18:42:30 +08:00
WANG WENJIN
0aad8199c7
Fix the summary error in subjective.py (#1363) 2024-07-25 18:36:13 +08:00
Linchen Xiao
8127fc3518
CompassBench subjective summarizer added (#1349)
* subjective summarizer added

* fix lint
2024-07-23 12:29:57 +08:00
Mo Li
104bddf647
[Doc] Update NeedleBench Docs (#1330)
* update needlebench docs

* update model_name_mapping dict

* update README

* Update README_zh-CN.md

---------

Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2024-07-18 13:16:19 +08:00
bittersweet1999
8e7ad2e981
[Fix] add bc for alignbench summarizer (#1306)
* fix pip version

* fix pip version

* fix alignbench

* fix import error
2024-07-12 11:06:20 +08:00
bittersweet1999
68ca48496b
[Refactor] Reorganize subjective eval (#1284)
* fix pip version

* fix pip version

* reorganize subjective eval

* reorg sub

* reorg subeval

* reorg subeval

* update subjective doc

* reorg subeval

* reorg subeval
2024-07-05 22:11:37 +08:00
Fengzhe Zhou
a32f21a356
[Sync] Sync with internal codes 2024.06.28 (#1279) 2024-06-28 14:16:34 +08:00
klein
1fa62c4a42
Support wildbench (#1266)
Co-authored-by: Leymore <zfz-960727@163.com>
2024-06-24 13:16:27 +08:00
bittersweet1999
982e024540
[Feature] add dataset Fofo (#1224)
* add fofo dataset

* add dataset fofo
2024-06-06 11:40:48 +08:00
Xingyuan Bu
02a0a4e857
MT-Bench-101 (#1215)
* add mt-bench-101

* add readme and requirements

* add mt-bench-101 data

* Update readme_mtbench101.md

* update readme

* update leaderboard

* fix typo

* Update readme_mtbench101.md

* fit newest opencompass

* update readme.md

* mtbench101 to opencompass

* mtbench101 to opencompass

* for code review

* for code review

* for code review

* hook

* hook

---------

Co-authored-by: liujie <ljie@buaa.edu.cn>
2024-06-03 14:52:12 +08:00
bittersweet1999
7c381e5be8
[Fix] fix summarizer (#1217)
* fix summarizer

* fix summarizer
2024-05-31 11:40:47 +08:00
Fengzhe Zhou
a77b8a5cec
[Sync] format (#1214) 2024-05-30 00:21:58 +08:00
Fengzhe Zhou
2954913d9b
[Sync] bump version (#1204) 2024-05-28 23:09:59 +08:00
bittersweet1999
8a8987be0b
fix arenahard summarizer (#1154)
Co-authored-by: Leymore <zfz-960727@163.com>
2024-05-15 13:31:29 +08:00
Fengzhe Zhou
7505b3cadf
[Feature] Add huggingface apply_chat_template (#1098)
* add TheoremQA with 5-shot

* add huggingface_above_v4_33 classes

* use num_worker partitioner in cli

* update theoremqa

* update TheoremQA

* add TheoremQA

* rename theoremqa -> TheoremQA

* update TheoremQA output path

* rewrite many model configs

* update huggingface

* further update

* refine configs

* update configs

* update configs

* add configs/eval_llama3_instruct.py

* add summarizer multi faceted

* update bbh datasets

* update configs/models/hf_llama/lmdeploy_llama3_8b_instruct.py

* rename class

* update readme

* update hf above v4.33
2024-05-14 14:50:16 +08:00
Mo Li
6c711cb262
[Fix] Fix Needlebench Summarizer (#1143)
* update few-shot example

* add 128k
2024-05-13 15:59:34 +08:00
Alexander Lam
35c94d0cde
[Feature] Adding support for LLM Compression Evaluation (#1108)
* fixed formatting based on pre-commit tests

* fixed typo in comments; reduced the number of models in the eval config

* fixed a bug in LLMCompressionDataset, where setting samples=None would result in passing test[:None] to load_dataset

* removed unnecessary variable in _format_table_pivot; changed lark_reporter message to English
2024-04-30 10:51:01 +08:00
liushz
a6f67e1a65
[Fix] Fix Math Evaluation with Judge Model Evaluator & Add README (#1103)
* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Add Math Evaluation with Judge Model Evaluator

* Fix Llama-3 meta template

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

* Fix MATH with JudgeLM Evaluation

---------

Co-authored-by: liuhongwei <liuhongwei@pjlab.org.cn>
2024-04-28 21:58:58 +08:00
Yggdrasill7D6
58a57a4c45
[Feature] add support for Flames datasets (#1093)
* add flames datasets

* fix lint

* rm quota

* add judgemodel info and fix os path

* support flames dataset

* support flames dataset

---------

Co-authored-by: bittersweet1999 <1487910649@qq.com>
2024-04-28 18:56:24 +08:00
klein
e4830a6926
Update CIBench (#1089)
* modify the requirements/runtime.txt: numpy==1.23.4 --> numpy>=1.23.4

* update cibench: dataset and evluation

* cibench summarizer bug

* update cibench

* move extract_code import

---------

Co-authored-by: zhangchuyu@pjlab.org.cn <zhangchuyu@pjlab.org.cn>
Co-authored-by: Leymore <zfz-960727@163.com>
2024-04-26 18:46:02 +08:00
bittersweet1999
e404b72c52
[Feature] support arenahard evaluation (#1096)
* support arenahard

* support arenahard

* support arenahard
2024-04-26 15:42:00 +08:00
bittersweet1999
6ba1c4937d
[Feature] Support Math evaluation via judgemodel (#1094)
* support openai math evaluation

* support openai math evaluation

* support openai math evaluation

* support math llm judge

* support math llm judge
2024-04-26 14:56:23 +08:00