OpenCompass/opencompass/configs/datasets/livemathbench/README.md

# LiveMathBench

## Details of Datsets

| dataset | language | #single-choice | #multiple-choice | #fill-in-the-blank | #problem-solving |
| -- | -- | -- | -- | -- | -- |
| AIMC | cn | 0 | 0 | 0 | 46 |
| AIMC | en | 0 | 0 | 0 | 46 |
| CEE | cn | 0 | 0 | 13 | 40 |
| CEE | en | 0 | 0 | 13 | 40 |
| CMO | cn | 0 | 0 | 0 | 18 |
| CMO | en | 0 | 0 | 0 | 18 |
| MATH500 | en | 0 | 0 | 0 | 500 |
| AIME2024 | en | 0 | 0 | 0 | 44 |


## How to use


```python
from mmengine.config import read_base

with read_base():
    from opencompass.datasets.livemathbench import livemathbench_datasets

livemathbench_datasets[0].update(
    {
        'abbr': 'livemathbench_${k}x${n}'
        'path': '/path/to/data/dir', 
        'k': 'k@pass', # the max value of k in k@pass
        'n': 'number of runs', # number of runs
    }
)
livemathbench_datasets[0]['eval_cfg']['evaluator'].update(
    {
        'model_name': 'Qwen/Qwen2.5-72B-Instruct', 
        'url': [
            'http://0.0.0.0:23333/v1', 
            '...'
        ]  # set url of evaluation models
    }
)

```

> ❗️ At present, `extract_from_boxed` is used to extract answers from model responses, and one can also leverage LLM for extracting through the following parameters, but this part of the code has not been tested.

```python
livemathbench_datasets[0]['eval_cfg']['evaluator'].update(
    {
        'model_name': 'Qwen/Qwen2.5-72B-Instruct', 
        'url': [
            'http://0.0.0.0:23333/v1', 
            '...'
        ],  # set url of evaluation models

        # for LLM-based extraction
        'use_extract_model': True,
        'post_model_name': 'oc-extractor',
        'post_url': [
            'http://0.0.0.0:21006/v1,
            '...'
        ]
    }
)
```

## Output Samples

| dataset | version | metric | mode | Qwen2.5-72B-Instruct |
|----- | ----- | ----- | ----- | -----|
| LiveMathBench | caed8f | 1@pass | gen | 26.07 |
| LiveMathBench | caed8f | 1@pass/std | gen | xx.xx |
| LiveMathBench | caed8f | 2@pass | gen | xx.xx |
| LiveMathBench | caed8f | 2@pass/std | gen | xx.xx |
| LiveMathBench | caed8f | pass-rate | gen | xx.xx |
[Feature] Support LiveMathBench (#1727) 2024-11-30 00:07:19 +08:00			`# LiveMathBench`

			`## Details of Datsets`

			`\| dataset \| language \| #single-choice \| #multiple-choice \| #fill-in-the-blank \| #problem-solving \|`
			`\| -- \| -- \| -- \| -- \| -- \| -- \|`
[Update] Update LiveMathBench Evaluation to Support Single Dataset Split Metric Computation (#1730) * upload dataset definitions & configs * add single dataset split specific metrics * add k-pass@threshold & MATH500 2024-12-05 16:54:16 +08:00			`\| AIMC \| cn \| 0 \| 0 \| 0 \| 46 \|`
			`\| AIMC \| en \| 0 \| 0 \| 0 \| 46 \|`
			`\| CEE \| cn \| 0 \| 0 \| 13 \| 40 \|`
			`\| CEE \| en \| 0 \| 0 \| 13 \| 40 \|`
[Feature] Support LiveMathBench (#1727) 2024-11-30 00:07:19 +08:00			`\| CMO \| cn \| 0 \| 0 \| 0 \| 18 \|`
			`\| CMO \| en \| 0 \| 0 \| 0 \| 18 \|`
[Update] Update LiveMathBench Evaluation to Support Single Dataset Split Metric Computation (#1730) * upload dataset definitions & configs * add single dataset split specific metrics * add k-pass@threshold & MATH500 2024-12-05 16:54:16 +08:00			`\| MATH500 \| en \| 0 \| 0 \| 0 \| 500 \|`
[Update] Add MATH500 & AIME2024 to LiveMathBench (#1741) * upload dataset definitions & configs * add single dataset split specific metrics * add k-pass@threshold & MATH500 * update std computation & k-pass computation * add AIME224 * update README 2024-12-06 14:36:49 +08:00			`\| AIME2024 \| en \| 0 \| 0 \| 0 \| 44 \|`
[Feature] Support LiveMathBench (#1727) 2024-11-30 00:07:19 +08:00

			`## How to use`


			```python
			`from mmengine.config import read_base`

			`with read_base():`
			`from opencompass.datasets.livemathbench import livemathbench_datasets`

			`livemathbench_datasets[0].update(`
			`{`
[Update] Update LiveMathBench Evaluation to Support Single Dataset Split Metric Computation (#1730) * upload dataset definitions & configs * add single dataset split specific metrics * add k-pass@threshold & MATH500 2024-12-05 16:54:16 +08:00			`'abbr': 'livemathbench_${k}x${n}'`
[Feature] Support LiveMathBench (#1727) 2024-11-30 00:07:19 +08:00			`'path': '/path/to/data/dir',`
			`'k': 'k@pass', # the max value of k in k@pass`
			`'n': 'number of runs', # number of runs`
			`}`
			`)`
			`livemathbench_datasets[0]['eval_cfg']['evaluator'].update(`
			`{`
			`'model_name': 'Qwen/Qwen2.5-72B-Instruct',`
			`'url': [`
			`'http://0.0.0.0:23333/v1',`
			`'...'`
			`] # set url of evaluation models`
			`}`
			`)`

			```

			> ❗️ At present, `extract_from_boxed` is used to extract answers from model responses, and one can also leverage LLM for extracting through the following parameters, but this part of the code has not been tested.

			```python
			`livemathbench_datasets[0]['eval_cfg']['evaluator'].update(`
			`{`
			`'model_name': 'Qwen/Qwen2.5-72B-Instruct',`
			`'url': [`
			`'http://0.0.0.0:23333/v1',`
			`'...'`
			`], # set url of evaluation models`

			`# for LLM-based extraction`
			`'use_extract_model': True,`
			`'post_model_name': 'oc-extractor',`
			`'post_url': [`
			`'http://0.0.0.0:21006/v1,`
			`'...'`
			`]`
			`}`
			`)`
			```

			`## Output Samples`

			`\| dataset \| version \| metric \| mode \| Qwen2.5-72B-Instruct \|`
			`\|----- \| ----- \| ----- \| ----- \| -----\|`
			`\| LiveMathBench \| caed8f \| 1@pass \| gen \| 26.07 \|`
			`\| LiveMathBench \| caed8f \| 1@pass/std \| gen \| xx.xx \|`
			`\| LiveMathBench \| caed8f \| 2@pass \| gen \| xx.xx \|`
			`\| LiveMathBench \| caed8f \| 2@pass/std \| gen \| xx.xx \|`
			`\| LiveMathBench \| caed8f \| pass-rate \| gen \| xx.xx \|`