mirror of https://github.com/open-compass/opencompass.git synced 2025-05-30 16:03:24 +08:00

History

Junnan Liu 8e8d4f1c64 [Feature] Support G-Pass@k and LiveMathBench (#1772 ) * support G-Pass@k and livemathbench * fix bugs * fix comments of GPassKEvaluator * update saved details of GPassKEvaluator * update saved details of GPassKEvaluator * fix eval api configs & update openai_api for ease of debugging * update huggingface path * fix method name of G-Pass@k * fix default value of eval_model_name * refactor G-Pass@k evaluator * log generation params for each backend * fix evaluation resume * add notimplementerror		2024-12-30 16:59:39 +08:00
..
livemathbench_gen_9befbf.py	[Feature] Support G-Pass@k and LiveMathBench (#1772 )	2024-12-30 16:59:39 +08:00
livemathbench_gen_caed8f.py	[Feature] Support LiveMathBench (#1727 )	2024-11-30 00:07:19 +08:00
livemathbench_gen_f1c095.py	[Update] Update O1-style Benchmark and Prompts (#1742 )	2024-12-09 13:48:56 +08:00
livemathbench_gen.py	[Feature] Support G-Pass@k and LiveMathBench (#1772 )	2024-12-30 16:59:39 +08:00
README.md	[Update] Add MATH500 & AIME2024 to LiveMathBench (#1741 )	2024-12-06 14:36:49 +08:00

README.md

LiveMathBench

Details of Datsets

dataset	language	#fill-in-the-blank	#problem-solving
AIMC	cn	0	46
AIMC	en	0	46
CEE	cn	13	40
CEE	en	13	40
CMO	cn	0	18
CMO	en	0	18
MATH500	en	0	500
AIME2024	en	0	44

How to use

from mmengine.config import read_base

with read_base():
    from opencompass.datasets.livemathbench import livemathbench_datasets

livemathbench_datasets[0].update(
    {
        'abbr': 'livemathbench_${k}x${n}'
        'path': '/path/to/data/dir', 
        'k': 'k@pass', # the max value of k in k@pass
        'n': 'number of runs', # number of runs
    }
)
livemathbench_datasets[0]['eval_cfg']['evaluator'].update(
    {
        'model_name': 'Qwen/Qwen2.5-72B-Instruct', 
        'url': [
            'http://0.0.0.0:23333/v1', 
            '...'
        ]  # set url of evaluation models
    }
)

❗️ At present, extract_from_boxed is used to extract answers from model responses, and one can also leverage LLM for extracting through the following parameters, but this part of the code has not been tested.

livemathbench_datasets[0]['eval_cfg']['evaluator'].update(
    {
        'model_name': 'Qwen/Qwen2.5-72B-Instruct', 
        'url': [
            'http://0.0.0.0:23333/v1', 
            '...'
        ],  # set url of evaluation models

        # for LLM-based extraction
        'use_extract_model': True,
        'post_model_name': 'oc-extractor',
        'post_url': [
            'http://0.0.0.0:21006/v1,
            '...'
        ]
    }
)

Output Samples

dataset	version	metric	mode	Qwen2.5-72B-Instruct
LiveMathBench	caed8f	1@pass	gen	26.07
LiveMathBench	caed8f	1@pass/std	gen	xx.xx
LiveMathBench	caed8f	2@pass	gen	xx.xx
LiveMathBench	caed8f	2@pass/std	gen	xx.xx
LiveMathBench	caed8f	pass-rate	gen	xx.xx

README.md Unescape Escape

LiveMathBench

Details of Datsets

How to use

Output Samples

README.md