2024-11-30 00:07:19 +08:00
|
|
|
# LiveMathBench
|
|
|
|
|
2025-02-20 19:47:04 +08:00
|
|
|
## v202412
|
|
|
|
|
|
|
|
### Details of Datsets
|
2024-11-30 00:07:19 +08:00
|
|
|
|
|
|
|
| dataset | language | #single-choice | #multiple-choice | #fill-in-the-blank | #problem-solving |
|
|
|
|
| -- | -- | -- | -- | -- | -- |
|
2025-02-20 19:47:04 +08:00
|
|
|
| AMC | cn | 0 | 0 | 0 | 46 |
|
|
|
|
| AMC | en | 0 | 0 | 0 | 46 |
|
|
|
|
| CCEE | cn | 0 | 0 | 13 | 31 |
|
|
|
|
| CCEE | en | 0 | 0 | 13 | 31 |
|
|
|
|
| CNMO | cn | 0 | 0 | 0 | 18 |
|
|
|
|
| CNMO | en | 0 | 0 | 0 | 18 |
|
|
|
|
| WLPMC | cn | 0 | 0 | 0 | 11 |
|
|
|
|
| WLPMC | en | 0 | 0 | 0 | 11 |
|
2024-11-30 00:07:19 +08:00
|
|
|
|
|
|
|
|
2025-02-20 19:47:04 +08:00
|
|
|
### How to use
|
2024-11-30 00:07:19 +08:00
|
|
|
|
2025-02-20 19:47:04 +08:00
|
|
|
#### G-Pass@k
|
2024-11-30 00:07:19 +08:00
|
|
|
```python
|
|
|
|
from mmengine.config import read_base
|
|
|
|
|
|
|
|
with read_base():
|
2025-02-20 19:47:04 +08:00
|
|
|
from opencompass.datasets.livemathbench_gen import livemathbench_datasets
|
2024-11-30 00:07:19 +08:00
|
|
|
|
|
|
|
livemathbench_datasets[0]['eval_cfg']['evaluator'].update(
|
|
|
|
{
|
|
|
|
'model_name': 'Qwen/Qwen2.5-72B-Instruct',
|
|
|
|
'url': [
|
|
|
|
'http://0.0.0.0:23333/v1',
|
|
|
|
'...'
|
|
|
|
] # set url of evaluation models
|
|
|
|
}
|
|
|
|
)
|
2025-02-20 19:47:04 +08:00
|
|
|
livemathbench_dataset['infer_cfg']['inferencer'].update(dict(
|
|
|
|
max_out_len=32768 # for o1-like models you need to update max_out_len
|
|
|
|
))
|
2024-11-30 00:07:19 +08:00
|
|
|
|
|
|
|
```
|
|
|
|
|
2025-02-20 19:47:04 +08:00
|
|
|
#### Greedy
|
2024-11-30 00:07:19 +08:00
|
|
|
```python
|
2025-02-20 19:47:04 +08:00
|
|
|
from mmengine.config import read_base
|
|
|
|
|
|
|
|
with read_base():
|
|
|
|
from opencompass.datasets.livemathbench_greedy_gen import livemathbench_datasets
|
|
|
|
|
2024-11-30 00:07:19 +08:00
|
|
|
livemathbench_datasets[0]['eval_cfg']['evaluator'].update(
|
|
|
|
{
|
|
|
|
'model_name': 'Qwen/Qwen2.5-72B-Instruct',
|
|
|
|
'url': [
|
|
|
|
'http://0.0.0.0:23333/v1',
|
|
|
|
'...'
|
2025-02-20 19:47:04 +08:00
|
|
|
] # set url of evaluation models
|
2024-11-30 00:07:19 +08:00
|
|
|
}
|
|
|
|
)
|
2025-02-20 19:47:04 +08:00
|
|
|
livemathbench_dataset['infer_cfg']['inferencer'].update(dict(
|
|
|
|
max_out_len=32768 # for o1-like models you need to update max_out_len
|
|
|
|
))
|
|
|
|
|
2024-11-30 00:07:19 +08:00
|
|
|
```
|
|
|
|
|
2025-02-20 19:47:04 +08:00
|
|
|
### Output Samples
|
2024-11-30 00:07:19 +08:00
|
|
|
|
|
|
|
| dataset | version | metric | mode | Qwen2.5-72B-Instruct |
|
|
|
|
|----- | ----- | ----- | ----- | -----|
|
2025-02-20 19:47:04 +08:00
|
|
|
| LiveMathBench | 9befbf | G-Pass@16_0.0 | gen | xx.xx |
|
|
|
|
| LiveMathBench | caed8f | G-Pass@16_0.25 | gen | xx.xx |
|
|
|
|
| LiveMathBench | caed8f | G-Pass@16_0.5 | gen | xx.xx |
|
|
|
|
| LiveMathBench | caed8f | G-Pass@16_0.75 | gen | xx.xx |
|
|
|
|
| LiveMathBench | caed8f | G-Pass@16_1.0 | gen | xx.xx |
|
2024-11-30 00:07:19 +08:00
|
|
|
|