Zhangzefeng
f046d49e92
Update run.py
2025-01-14 17:44:22 +08:00
Alexander Lam
7f2aeeff26
added predicted win rates reporting to bradley terry subj eval methods with an option to switch between win rates and elo ratings ( #1815 )
2025-01-10 18:20:25 +08:00
zhulinJulia24
121d482378
[CI] Fix path conflict ( #1814 )
...
* update
* Update pr-run-test.yml
* update
2025-01-09 20:16:08 +08:00
zhulinJulia24
abdcee68f6
[CI] Update daily test metrics threshold ( #1812 )
...
* Update daily-run-test.yml
* Update pr-run-test.yml
* update
* update
* update
* updaet
* update
* update
* update
* update
* update
* update
* update
---------
Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
2025-01-09 18:16:24 +08:00
Zhao Qihao
e039f3efa0
[Feature] Support MMLU-CF Benchmark ( #1775 )
...
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* Update mmlu-cf
* Update mmlu-cf
* Update mmlu-cf
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* [Feature] Support MMLU-CF Benchmark
* Remove outside configs
---------
Co-authored-by: liushz <qq1791167085@163.com>
2025-01-09 14:11:20 +08:00
Songyang Zhang
f1e50d4bf0
[Update] Update LiveMathBench ( #1809 )
...
* Update LiveMathBench
* Update New O1 Evaluation
* Update O1 evaluation
2025-01-07 19:16:12 +08:00
Songyang Zhang
8fdb72f567
[Update] Update o1 eval prompt ( #1806 )
...
* Update XML prediction post-process
* Update LiveMathBench
* Update LiveMathBench
* Update New O1 Evaluation
2025-01-07 00:14:32 +08:00
Alexander Lam
f871e80887
[Feature] Add Bradley-Terry Subjective Evaluation method to Arena Hard dataset ( #1802 )
...
* added base_models_abbrs to references (passed from LMEvaluator); added bradleyterry subjective evaluation method for wildbench, alpacaeval, and compassarena datasets; added all_scores output files for reference in CompassArenaBradleyTerrySummarizer;
* added bradleyterry subjective evaluation method to arena_hard dataset
2025-01-03 16:33:43 +08:00
Linchen Xiao
117dc500ad
[Feature] Add Longbenchv2 support ( #1801 )
...
* Create eval_longbenchv2.py
* Create longbenchv2_gen.py
* Update __init__.py
* Create longbenchv2.py
* Update datasets_info.py
* update
* update
* update
* update
* update
* update
---------
Co-authored-by: abrohamLee <146956824+abrohamLee@users.noreply.github.com>
2025-01-03 12:04:29 +08:00
Linchen Xiao
f3220438bc
[BUMP] Bump version to 0.3.9 ( #1790 )
2024-12-31 16:52:47 +08:00
liushz
9c980cbc62
[Feature] Add LiveStemBench Dataset ( #1794 )
...
* [Fix] Fix vllm max_seq_len parameter transfer
* [Fix] Fix vllm max_seq_len parameter transfer
* Add livestembench dataset
* Add livestembench dataset
* Add livestembench dataset
* Update livestembench_gen_3e3c50.py
* Update eval_livestembench.py
* Update eval_livestembench.py
2024-12-31 15:17:39 +08:00
Songyang Zhang
fc0556ec8e
[Fix] Fix generic_llm_evaluator output_path ( #1798 )
...
* Fix output_path
* Add Logger
2024-12-31 13:05:05 +08:00
Alexander Lam
dc6035cfcb
[Feature] Added Bradley-Terry subjective evaluation
2024-12-31 11:01:23 +08:00
Songyang Zhang
98435dd98e
[Feature] Update o1 evaluation with JudgeLLM ( #1795 )
...
* Update Generic LLM Evaluator
* Update o1 style evaluator
2024-12-30 17:31:00 +08:00
Junnan Liu
8e8d4f1c64
[Feature] Support G-Pass@k and LiveMathBench ( #1772 )
...
* support G-Pass@k and livemathbench
* fix bugs
* fix comments of GPassKEvaluator
* update saved details of GPassKEvaluator
* update saved details of GPassKEvaluator
* fix eval api configs & update openai_api for ease of debugging
* update huggingface path
* fix method name of G-Pass@k
* fix default value of eval_model_name
* refactor G-Pass@k evaluator
* log generation params for each backend
* fix evaluation resume
* add notimplementerror
2024-12-30 16:59:39 +08:00
Linchen Xiao
42b54d6bb8
[Update] Add 0shot CoT config for TheoremQA ( #1783 )
2024-12-27 16:17:27 +08:00
bittersweet1999
357ce8c7a4
[Fix] Fix model summarizer abbr ( #1789 )
...
* fix pip version
* fix pip version
* fix model summarizer abbr
---------
Co-authored-by: root <bittersweet1999>
2024-12-27 14:45:08 +08:00
Linchen Xiao
ae9efb73ad
[CI] Pypi deploy workflow update ( #1786 )
2024-12-27 14:08:37 +08:00
Linchen Xiao
f103e90764
[CI] Update deploy python version ( #1784 )
2024-12-27 13:35:36 +08:00
zhulinJulia24
ebeb578fbf
[ci] remove daily step retry and update pr score ( #1782 )
...
[ci] remove daily step retry
2024-12-26 16:51:26 +08:00
Linchen Xiao
56eaac6d8f
[Update] Volc status exception handle ( #1780 )
...
* update
* update
2024-12-26 15:43:24 +08:00
zhulinJulia24
c48bbde26f
[ci] remove testcase into volc engine ( #1777 )
...
* update
* update
* update
* update
* update
* update
* updaste
* update
* update
* update
* update
* update
* update
* update
* updaste
* update
* update
* update
* update
* update
* update
* update
* update
* update
* Update daily-run-test.yml
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
* update
2024-12-25 17:26:50 +08:00
Linchen Xiao
ebefffed61
[Update] Update OC academic 202412 ( #1771 )
...
* [Update] Update academic settings
* Update
* update
2024-12-19 18:07:34 +08:00
Chang Lan
d70100cdf2
[Update] Customizable tokenizer for RULER ( #1731 )
...
* Customizable tokenizer for RULER
* Relax requirements
2024-12-19 18:02:11 +08:00
Junnan Liu
499302857f
[Fix] Fix Local Runner Params Save Path ( #1768 )
...
* update local runner params save dir
* fix remove
* fix directory remove
* Fix *_params.py by uuid4
2024-12-19 16:07:34 +08:00
Mashiro
9a5adbde6a
[Fix] Fix lark reporter issue ( #1769 )
2024-12-18 19:33:06 +08:00
zhulinJulia24
111f817e04
[ci] add fullbench testcase ( #1766 )
...
add volc testcase
2024-12-18 13:24:28 +08:00
bittersweet1999
38dba9919b
[Fix] Fix Subjective summarizer order error ( #1767 )
...
* fix pip version
* fix pip version
* fix order error
2024-12-18 13:21:31 +08:00
Linchen Xiao
d593bfeac8
[Bump] Bump version to 0.3.8 ( #1765 )
...
* [Bump] Bump version to 0.3.8
* Update README.md
2024-12-17 19:17:18 +08:00
Linchen Xiao
eadbdcb4cb
[Update] Update requirement and deepseek configurations ( #1764 )
2024-12-17 10:16:47 +08:00
liushz
5c8e91f329
[Fix] Fix vllm max_seq_len parameter transfer ( #1745 )
...
* [Fix] Fix vllm max_seq_len parameter transfer
* [Fix] Fix vllm max_seq_len parameter transfer
* Update pr-run-test.yml
* Update pr-run-test.yml
---------
Co-authored-by: zhulinJulia24 <145004780+zhulinJulia24@users.noreply.github.com>
2024-12-16 21:44:36 +08:00
Alexander Lam
1bd594fc62
[Feature] Added CompassArena-SubjectiveBench with Bradley-Terry Model ( #1751 )
...
* fix lint issues
* updated gitignore
* changed infer_order from random to double for the pairwise_judge.py (not changing for pairwise_bt_judge.py
* added return statement to CompassArenaBradleyTerrySummarizer to return overall score for each judger model
2024-12-16 13:41:28 +08:00
zhulinJulia24
aeded4c4db
add new dataset summerizer ( #1758 )
...
add new dataset summerizer
2024-12-13 09:50:43 +08:00
zhulinJulia24
a1c00cc8b7
[ci] add common_summarizer return ( #1724 )
...
* Update common_summarizer.py
* Update common_summarizer.py
2024-12-11 20:38:32 +08:00
liushz
c4ce0174fe
[Fix] Fix ChineseSimpleQA max_out_len ( #1757 )
...
* add chinese simpleqa config
* add chinese simpleqa config
* add chinese simpleqa config
* add chinese simpleqa config
* Update CsimpleQA
* Update CsimpleQA
* Update CsimpleQA
* Update CsimpleQA
* Update CsimpleQA
* Update CsimpleQA
* pdate Csimpleqa
* pdate Csimpleqa
* Update Csimpleqa
---------
Co-authored-by: 明念 <heyancheng.hyc@taobao.com>
2024-12-11 19:51:27 +08:00
Linchen Xiao
bd7b705be4
[Update] Update dataset configuration with no max_out_len ( #1754 )
2024-12-11 18:20:29 +08:00
OpenStellarTeam
1a5b3fc11e
Add Chinese SimpleQA config ( #1697 )
...
* add chinese simpleqa config
* add chinese simpleqa config
* add chinese simpleqa config
* add chinese simpleqa config
* Update CsimpleQA
* Update CsimpleQA
* Update CsimpleQA
* Update CsimpleQA
* Update CsimpleQA
* Update CsimpleQA
* pdate Csimpleqa
---------
Co-authored-by: 明念 <heyancheng.hyc@taobao.com>
Co-authored-by: liushz <qq1791167085@163.com>
2024-12-11 18:03:39 +08:00
Linchen Xiao
0d26b348e4
[Feature] Add OC academic 2412 ( #1750 )
2024-12-10 21:53:06 +08:00
bittersweet1999
54c0fb7a93
[Change] Change Compassarena metric ( #1749 )
...
* fix pip version
* fix pip version
* fix summarizer bug
* fix compassarena
* fix compassarena
* fix compassarena
2024-12-10 14:45:32 +08:00
Songyang Zhang
0d8df541bc
[Update] Update O1-style Benchmark and Prompts ( #1742 )
...
* Update JuderBench
* Support O1-style Prompts
* Update Code
* Update OpenAI
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update
* Update
* Update
* Update
2024-12-09 13:48:56 +08:00
Junnan Liu
f333be177c
[Update] Add MATH500 & AIME2024 to LiveMathBench ( #1741 )
...
* upload dataset definitions & configs
* add single dataset split specific metrics
* add k-pass@threshold & MATH500
* update std computation & k-pass computation
* add AIME224
* update README
2024-12-06 14:36:49 +08:00
bittersweet1999
08d63b5bf3
[Fix] Fix error in subjective default summarizer ( #1740 )
...
* fix pip version
* fix pip version
* fix summarizer bug
2024-12-06 11:03:53 +08:00
Songyang Zhang
fb43dd1906
[Update] Update Skywork/Qwen-QwQ ( #1728 )
...
* Update JuderBench
* Support O1-style Prompts
* Update Code
* Update OpenAI
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update BigCodeBench
* Update
2024-12-05 19:30:43 +08:00
Junnan Liu
6181ac1122
[Update] Update LiveMathBench Evaluation to Support Single Dataset Split Metric Computation ( #1730 )
...
* upload dataset definitions & configs
* add single dataset split specific metrics
* add k-pass@threshold & MATH500
2024-12-05 16:54:16 +08:00
Linchen Xiao
4f317d1bd5
[Update] Update Manifest ( #1738 )
2024-12-05 13:59:56 +08:00
Linchen Xiao
ac23f0ce1f
[Update] Update init file for Korbench ( #1737 )
2024-12-05 11:26:00 +08:00
Yufeng Zhao
4d773904d4
[Update] Korbench readme supplementation ( #1734 )
...
* renewed
* readme
---------
Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2024-12-05 11:24:35 +08:00
Linchen Xiao
a011be6798
[Feature] DLC runner Lark report ( #1735 )
...
* [Bump] Bump version to 0.3.7
* DLC lark report update
2024-12-04 18:03:12 +08:00
Linchen Xiao
e2a290fd46
[Bump] Bump version to 0.3.7 ( #1733 )
2024-12-03 19:34:57 +08:00
Yufeng Zhao
98c4666d65
[Update] Update Korbench dataset abbr ( #1729 )
...
Co-authored-by: yufeng zhao <zhaoyufeng@pjlab.org.cn>
2024-12-02 16:20:58 +08:00