OpenCompass/docs/en/advanced_guides/math_verify.md

# General Math Evaluation Guidance

## Introduction

Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHVerifyEvaluator components.

## Dataset Format

The math evaluation dataset should be in either JSON Lines (.jsonl) or CSV format. Each problem should contain at least:

- A problem statement
- A solution/answer (typically in LaTeX format with the final answer in \\boxed{})

Example JSONL format:

```json
{"problem": "Find the value of x if 2x + 3 = 7", "solution": "Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"}
```

Example CSV format:

```csv
problem,solution
"Find the value of x if 2x + 3 = 7","Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"
```

## Configuration

To evaluate mathematical reasoning, you'll need to set up three main components:

1. Dataset Reader Configuration

```python
math_reader_cfg = dict(
    input_columns=['problem'],  # Column name for the question
    output_column='solution'    # Column name for the answer
)
```

2. Inference Configuration

```python
math_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(
            round=[
                dict(
                    role='HUMAN',
                    prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',
                ),
            ]
        ),
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=GenInferencer),
)
```

3. Evaluation Configuration

```python
math_eval_cfg = dict(
    evaluator=dict(type=MATHVerifyEvaluator),
)
```

## Using CustomDataset

Here's how to set up a complete configuration for math evaluation:

```python
from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.datasets import CustomDataset

math_datasets = [
    dict(
        type=CustomDataset,
        abbr='my-math-dataset',              # Dataset abbreviation
        path='path/to/your/dataset',         # Path to your dataset file
        reader_cfg=math_reader_cfg,
        infer_cfg=math_infer_cfg,
        eval_cfg=math_eval_cfg,
    )
]
```

## MATHVerifyEvaluator

The MATHVerifyEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.

The MATHVerifyEvaluator implements:

1. Extracts answers from both predictions and references using LaTeX extraction
2. Handles various LaTeX formats and environments
3. Verifies mathematical equivalence between predicted and reference answers
4. Provides detailed evaluation results including:
   - Accuracy score
   - Detailed comparison between predictions and references
   - Parse results of both predicted and reference answers

The evaluator supports:

- Basic arithmetic operations
- Fractions and decimals
- Algebraic expressions
- Trigonometric functions
- Roots and exponents
- Mathematical symbols and operators

Example evaluation output:

```python
{
    'accuracy': 85.0,  # Percentage of correct answers
    'details': [
        {
            'predictions': 'x = 2',           # Parsed prediction
            'references': 'x = 2',         # Parsed reference
            'correct': True            # Whether they match
        },
        # ... more results
    ]
}
```

## Complete Example

Here's a complete example of how to set up math evaluation:

```python
from mmengine.config import read_base
from opencompass.models import TurboMindModelwithChatTemplate
from opencompass.datasets import CustomDataset
from opencompass.openicl.icl_evaluator.math_evaluator import MATHVerifyEvaluator
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer

# Dataset reader configuration
math_reader_cfg = dict(input_columns=['problem'], output_column='solution')

# Inference configuration
math_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(
            round=[
                dict(
                    role='HUMAN',
                    prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',
                ),
            ]
        ),
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=GenInferencer),
)

# Evaluation configuration
math_eval_cfg = dict(
    evaluator=dict(type=MATHVerifyEvaluator),
)

# Dataset configuration
math_datasets = [
    dict(
        type=CustomDataset,
        abbr='my-math-dataset',
        path='path/to/your/dataset.jsonl',  # or .csv
        reader_cfg=math_reader_cfg,
        infer_cfg=math_infer_cfg,
        eval_cfg=math_eval_cfg,
    )
]

# Model configuration
models = [
    dict(
        type=TurboMindModelwithChatTemplate,
        abbr='your-model-name',
        path='your/model/path',
        # ... other model configurations
    )
]

# Output directory
work_dir = './outputs/math_eval'
```
[Feature] Add general math, llm judge evaluator (#1892) * update_doc * update llm_judge * update README * update md file name 2025-02-26 15:08:50 +08:00			`# General Math Evaluation Guidance`

			`## Introduction`

[Update] Add CascadeEvaluator with Data Replica (#2022) * Update CascadeEvaluator * Update CascadeEvaluator * Update CascadeEvaluator * Update Config * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update 2025-05-20 16:46:55 +08:00			`Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHVerifyEvaluator components.`
[Feature] Add general math, llm judge evaluator (#1892) * update_doc * update llm_judge * update README * update md file name 2025-02-26 15:08:50 +08:00
			`## Dataset Format`

			`The math evaluation dataset should be in either JSON Lines (.jsonl) or CSV format. Each problem should contain at least:`

			`- A problem statement`
			`- A solution/answer (typically in LaTeX format with the final answer in \\boxed{})`

			`Example JSONL format:`

			```json
			`{"problem": "Find the value of x if 2x + 3 = 7", "solution": "Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"}`
			```

			`Example CSV format:`

			```csv
			`problem,solution`
			`"Find the value of x if 2x + 3 = 7","Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"`
			```

			`## Configuration`

			`To evaluate mathematical reasoning, you'll need to set up three main components:`

			`1. Dataset Reader Configuration`

			```python
			`math_reader_cfg = dict(`
			`input_columns=['problem'], # Column name for the question`
			`output_column='solution' # Column name for the answer`
			`)`
			```

			`2. Inference Configuration`

			```python
			`math_infer_cfg = dict(`
			`prompt_template=dict(`
			`type=PromptTemplate,`
			`template=dict(`
			`round=[`
			`dict(`
			`role='HUMAN',`
			`prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',`
			`),`
			`]`
			`),`
			`),`
			`retriever=dict(type=ZeroRetriever),`
			`inferencer=dict(type=GenInferencer),`
			`)`
			```

			`3. Evaluation Configuration`

			```python
			`math_eval_cfg = dict(`
[Update] Add CascadeEvaluator with Data Replica (#2022) * Update CascadeEvaluator * Update CascadeEvaluator * Update CascadeEvaluator * Update Config * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update 2025-05-20 16:46:55 +08:00			`evaluator=dict(type=MATHVerifyEvaluator),`
[Feature] Add general math, llm judge evaluator (#1892) * update_doc * update llm_judge * update README * update md file name 2025-02-26 15:08:50 +08:00			`)`
			```

			`## Using CustomDataset`

			`Here's how to set up a complete configuration for math evaluation:`

			```python
			`from mmengine.config import read_base`
			`from opencompass.models import TurboMindModelwithChatTemplate`
			`from opencompass.datasets import CustomDataset`

			`math_datasets = [`
			`dict(`
			`type=CustomDataset,`
			`abbr='my-math-dataset', # Dataset abbreviation`
			`path='path/to/your/dataset', # Path to your dataset file`
			`reader_cfg=math_reader_cfg,`
			`infer_cfg=math_infer_cfg,`
			`eval_cfg=math_eval_cfg,`
			`)`
			`]`
			```

[Update] Add CascadeEvaluator with Data Replica (#2022) * Update CascadeEvaluator * Update CascadeEvaluator * Update CascadeEvaluator * Update Config * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update 2025-05-20 16:46:55 +08:00			`## MATHVerifyEvaluator`
[Feature] Add general math, llm judge evaluator (#1892) * update_doc * update llm_judge * update README * update md file name 2025-02-26 15:08:50 +08:00
[Update] Add CascadeEvaluator with Data Replica (#2022) * Update CascadeEvaluator * Update CascadeEvaluator * Update CascadeEvaluator * Update Config * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update 2025-05-20 16:46:55 +08:00			`The MATHVerifyEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.`
[Feature] Add general math, llm judge evaluator (#1892) * update_doc * update llm_judge * update README * update md file name 2025-02-26 15:08:50 +08:00
[Update] Add CascadeEvaluator with Data Replica (#2022) * Update CascadeEvaluator * Update CascadeEvaluator * Update CascadeEvaluator * Update Config * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update 2025-05-20 16:46:55 +08:00			`The MATHVerifyEvaluator implements:`
[Feature] Add general math, llm judge evaluator (#1892) * update_doc * update llm_judge * update README * update md file name 2025-02-26 15:08:50 +08:00
			`1. Extracts answers from both predictions and references using LaTeX extraction`
			`2. Handles various LaTeX formats and environments`
			`3. Verifies mathematical equivalence between predicted and reference answers`
			`4. Provides detailed evaluation results including:`
			`- Accuracy score`
			`- Detailed comparison between predictions and references`
			`- Parse results of both predicted and reference answers`

			`The evaluator supports:`

			`- Basic arithmetic operations`
			`- Fractions and decimals`
			`- Algebraic expressions`
			`- Trigonometric functions`
			`- Roots and exponents`
			`- Mathematical symbols and operators`

			`Example evaluation output:`

			```python
			`{`
			`'accuracy': 85.0, # Percentage of correct answers`
			`'details': [`
			`{`
			`'predictions': 'x = 2', # Parsed prediction`
			`'references': 'x = 2', # Parsed reference`
			`'correct': True # Whether they match`
			`},`
			`# ... more results`
			`]`
			`}`
			```

			`## Complete Example`

			`Here's a complete example of how to set up math evaluation:`

			```python
			`from mmengine.config import read_base`
			`from opencompass.models import TurboMindModelwithChatTemplate`
			`from opencompass.datasets import CustomDataset`
[Update] Add CascadeEvaluator with Data Replica (#2022) * Update CascadeEvaluator * Update CascadeEvaluator * Update CascadeEvaluator * Update Config * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update 2025-05-20 16:46:55 +08:00			`from opencompass.openicl.icl_evaluator.math_evaluator import MATHVerifyEvaluator`
[Feature] Add general math, llm judge evaluator (#1892) * update_doc * update llm_judge * update README * update md file name 2025-02-26 15:08:50 +08:00			`from opencompass.openicl.icl_prompt_template import PromptTemplate`
			`from opencompass.openicl.icl_retriever import ZeroRetriever`
			`from opencompass.openicl.icl_inferencer import GenInferencer`

			`# Dataset reader configuration`
			`math_reader_cfg = dict(input_columns=['problem'], output_column='solution')`

			`# Inference configuration`
			`math_infer_cfg = dict(`
			`prompt_template=dict(`
			`type=PromptTemplate,`
			`template=dict(`
			`round=[`
			`dict(`
			`role='HUMAN',`
			`prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',`
			`),`
			`]`
			`),`
			`),`
			`retriever=dict(type=ZeroRetriever),`
			`inferencer=dict(type=GenInferencer),`
			`)`

			`# Evaluation configuration`
			`math_eval_cfg = dict(`
[Update] Add CascadeEvaluator with Data Replica (#2022) * Update CascadeEvaluator * Update CascadeEvaluator * Update CascadeEvaluator * Update Config * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update 2025-05-20 16:46:55 +08:00			`evaluator=dict(type=MATHVerifyEvaluator),`
[Feature] Add general math, llm judge evaluator (#1892) * update_doc * update llm_judge * update README * update md file name 2025-02-26 15:08:50 +08:00			`)`

			`# Dataset configuration`
			`math_datasets = [`
			`dict(`
			`type=CustomDataset,`
			`abbr='my-math-dataset',`
			`path='path/to/your/dataset.jsonl', # or .csv`
			`reader_cfg=math_reader_cfg,`
			`infer_cfg=math_infer_cfg,`
			`eval_cfg=math_eval_cfg,`
			`)`
			`]`

			`# Model configuration`
			`models = [`
			`dict(`
			`type=TurboMindModelwithChatTemplate,`
			`abbr='your-model-name',`
			`path='your/model/path',`
			`# ... other model configurations`
			`)`
			`]`

			`# Output directory`
			`work_dir = './outputs/math_eval'`
			```