update_doc

2025-05-30 16:03:24 +08:00 · 2025-02-25 06:57:24 +00:00 · 2025-02-25 06:57:24 +00:00 · 5b9f4a4e7b
commit 5b9f4a4e7b
parent 046b6f75c6
4 changed files with 382 additions and 13 deletions
--- a/docs/en/advanced_guides/general_math.md
+++ b/docs/en/advanced_guides/general_math.md
@ -0,0 +1,190 @@
 # General Math Evaluation Guidance
 ## Introduction
 Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHEvaluator components.
 ## Dataset Format
 The math evaluation dataset should be in either JSON Lines (.jsonl) or CSV format. Each problem should contain at least:
 - A problem statement
 - A solution/answer (typically in LaTeX format with the final answer in \\boxed{})
 Example JSONL format:
 ```json
 {"problem": "Find the value of x if 2x + 3 = 7", "solution": "Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"}
 ```
 Example CSV format:
 ```csv
 problem,solution
 "Find the value of x if 2x + 3 = 7","Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"
 ```
 ## Configuration
 To evaluate mathematical reasoning, you'll need to set up three main components:
 1. Dataset Reader Configuration
 ```python
 math_reader_cfg = dict(
    input_columns=['problem'],  # Column name for the question
    output_column='solution'    # Column name for the answer
 )
 ```
 2. Inference Configuration
 ```python
 math_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(
            round=[
                dict(
                    role='HUMAN',
                    prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',
                ),
            ]
        ),
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=GenInferencer),
 )
 ```
 3. Evaluation Configuration
 ```python
 math_eval_cfg = dict(
    evaluator=dict(type=MATHEvaluator),
 )
 ```
 ## Using CustomDataset
 Here's how to set up a complete configuration for math evaluation:
 ```python
 from mmengine.config import read_base
 from opencompass.models import TurboMindModelwithChatTemplate
 from opencompass.datasets import CustomDataset
 math_datasets = [
    dict(
        type=CustomDataset,
        abbr='my-math-dataset',              # Dataset abbreviation
        path='path/to/your/dataset',         # Path to your dataset file
        reader_cfg=math_reader_cfg,
        infer_cfg=math_infer_cfg,
        eval_cfg=math_eval_cfg,
    )
 ]
 ```
 ## MATHEvaluator
 The MATHEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.
 The MATHEvaluator implements:
 1. Extracts answers from both predictions and references using LaTeX extraction
 2. Handles various LaTeX formats and environments
 3. Verifies mathematical equivalence between predicted and reference answers
 4. Provides detailed evaluation results including:
   - Accuracy score
   - Detailed comparison between predictions and references
   - Parse results of both predicted and reference answers
 The evaluator supports:
 - Basic arithmetic operations
 - Fractions and decimals
 - Algebraic expressions
 - Trigonometric functions
 - Roots and exponents
 - Mathematical symbols and operators
 Example evaluation output:
 ```python
 {
    'accuracy': 85.0,  # Percentage of correct answers
    'details': [
        {
            'predictions': 'x = 2',           # Parsed prediction
            'references': 'x = 2',         # Parsed reference
            'correct': True            # Whether they match
        },
        # ... more results
    ]
 }
 ```
 ## Complete Example
 Here's a complete example of how to set up math evaluation:
 ```python
 from mmengine.config import read_base
 from opencompass.models import TurboMindModelwithChatTemplate
 from opencompass.datasets import CustomDataset
 from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
 from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
 # Dataset reader configuration
 math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
 # Inference configuration
 math_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(
            round=[
                dict(
                    role='HUMAN',
                    prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',
                ),
            ]
        ),
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=GenInferencer),
 )
 # Evaluation configuration
 math_eval_cfg = dict(
    evaluator=dict(type=MATHEvaluator),
 )
 # Dataset configuration
 math_datasets = [
    dict(
        type=CustomDataset,
        abbr='my-math-dataset',
        path='path/to/your/dataset.jsonl',  # or .csv
        reader_cfg=math_reader_cfg,
        infer_cfg=math_infer_cfg,
        eval_cfg=math_eval_cfg,
    )
 ]
 # Model configuration
 models = [
    dict(
        type=TurboMindModelwithChatTemplate,
        abbr='your-model-name',
        path='your/model/path',
        # ... other model configurations
    )
 ]
 # Output directory
 work_dir = './outputs/math_eval'
 ```
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
@ -40,7 +40,6 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
   user_guides/experimentation.md
   user_guides/metrics.md
   user_guides/summarizer.md
   user_guides/corebench.md
 .. _Prompt:
 .. toctree::
@ -61,16 +60,12 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
   advanced_guides/new_dataset.md
   advanced_guides/custom_dataset.md
   advanced_guides/new_model.md
   advanced_guides/evaluation_lmdeploy.md
   advanced_guides/evaluation_lightllm.md
   advanced_guides/accelerator_intro.md
   advanced_guides/general_math.md
   advanced_guides/code_eval.md
   advanced_guides/code_eval_service.md
   advanced_guides/prompt_attack.md
   advanced_guides/longeval.md
   advanced_guides/subjective_evaluation.md
   advanced_guides/circular_eval.md
   advanced_guides/contamination_eval.md
   advanced_guides/needleinahaystack_eval.md
 .. _Tools:
--- a/docs/zh_cn/advanced_guides/general_math.md
+++ b/docs/zh_cn/advanced_guides/general_math.md
@ -0,0 +1,190 @@
 # 数学能力评测
 ## 简介
 数学推理能力是大语言模型(LLMs)的一项关键能力。为了评估模型的数学能力，我们需要测试其逐步解决数学问题并提供准确最终答案的能力。OpenCompass 通过 CustomDataset 和 MATHEvaluator 组件提供了一种便捷的数学推理评测方式。
 ## 数据集格式
 数学评测数据集应该是 JSON Lines (.jsonl) 或 CSV 格式。每个问题至少应包含：
 - 问题陈述
 - 解答/答案（通常使用 LaTeX 格式，最终答案需要用 \\boxed{} 括起来）
 JSONL 格式示例：
 ```json
 {"problem": "求解方程 2x + 3 = 7", "solution": "让我们逐步解决：\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\n因此，\\boxed{2}"}
 ```
 CSV 格式示例：
 ```csv
 problem,solution
 "求解方程 2x + 3 = 7","让我们逐步解决：\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\n因此，\\boxed{2}"
 ```
 ## 配置说明
 要进行数学推理评测，你需要设置三个主要组件：
 1. 数据集读取配置
 ```python
 math_reader_cfg = dict(
    input_columns=['problem'],  # 问题列的名称
    output_column='solution'    # 答案列的名称
 )
 ```
 2. 推理配置
 ```python
 math_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(
            round=[
                dict(
                    role='HUMAN',
                    prompt='{problem}\n请逐步推理，并将最终答案放在 \\boxed{} 中。',
                ),
            ]
        ),
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=GenInferencer),
 )
 ```
 3. 评测配置
 ```python
 math_eval_cfg = dict(
    evaluator=dict(type=MATHEvaluator),
 )
 ```
 ## 使用 CustomDataset
 以下是如何设置完整的数学评测配置：
 ```python
 from mmengine.config import read_base
 from opencompass.models import TurboMindModelwithChatTemplate
 from opencompass.datasets import CustomDataset
 math_datasets = [
    dict(
        type=CustomDataset,
        abbr='my-math-dataset',              # 数据集简称
        path='path/to/your/dataset',         # 数据集文件路径
        reader_cfg=math_reader_cfg,
        infer_cfg=math_infer_cfg,
        eval_cfg=math_eval_cfg,
    )
 ]
 ```
 ## MATHEvaluator
 MATHEvaluator 是专门设计用于评估数学答案的评测器。它基于 math_verify 库进行开发，该库提供了数学表达式解析和验证功能，支持 LaTeX 和一般表达式的提取与等价性验证。
 MATHEvaluator 具有以下功能：
 1. 使用 LaTeX 提取器从预测和参考答案中提取答案
 2. 处理各种 LaTeX 格式和环境
 3. 验证预测答案和参考答案之间的数学等价性
 4. 提供详细的评测结果，包括：
   - 准确率分数
   - 预测和参考答案的详细比较
   - 预测和参考答案的解析结果
 评测器支持：
 - 基本算术运算
 - 分数和小数
 - 代数表达式
 - 三角函数
 - 根式和指数
 - 数学符号和运算符
 评测输出示例：
 ```python
 {
    'accuracy': 85.0,  # 正确答案的百分比
    'details': [
        {
            'predictions': 'x = 2',           # 解析后的预测答案
            'references': 'x = 2',         # 解析后的参考答案
            'correct': True            # 是否匹配
        },
        # ... 更多结果
    ]
 }
 ```
 ## 完整示例
 以下是设置数学评测的完整示例：
 ```python
 from mmengine.config import read_base
 from opencompass.models import TurboMindModelwithChatTemplate
 from opencompass.datasets import CustomDataset
 from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
 from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
 # 数据集读取配置
 math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
 # 推理配置
 math_infer_cfg = dict(
    prompt_template=dict(
        type=PromptTemplate,
        template=dict(
            round=[
                dict(
                    role='HUMAN',
                    prompt='{problem}\n请逐步推理，并将最终答案放在 \\boxed{} 中。',
                ),
            ]
        ),
    ),
    retriever=dict(type=ZeroRetriever),
    inferencer=dict(type=GenInferencer),
 )
 # 评测配置
 math_eval_cfg = dict(
    evaluator=dict(type=MATHEvaluator),
 )
 # 数据集配置
 math_datasets = [
    dict(
        type=CustomDataset,
        abbr='my-math-dataset',
        path='path/to/your/dataset.jsonl',  # 或 .csv
        reader_cfg=math_reader_cfg,
        infer_cfg=math_infer_cfg,
        eval_cfg=math_eval_cfg,
    )
 ]
 # 模型配置
 models = [
    dict(
        type=TurboMindModelwithChatTemplate,
        abbr='your-model-name',
        path='your/model/path',
        # ... 其他模型配置
    )
 ]
 # 输出目录
 work_dir = './outputs/math_eval'
 ```
--- a/docs/zh_cn/index.rst
+++ b/docs/zh_cn/index.rst
@ -41,7 +41,6 @@ OpenCompass 上手路线
   user_guides/experimentation.md
   user_guides/metrics.md
   user_guides/summarizer.md
   user_guides/corebench.md
 .. _提示词:
 .. toctree::
@ -61,17 +60,12 @@ OpenCompass 上手路线
   advanced_guides/new_dataset.md
   advanced_guides/custom_dataset.md
   advanced_guides/new_model.md
   advanced_guides/evaluation_lmdeploy.md
   advanced_guides/evaluation_lightllm.md
   advanced_guides/accelerator_intro.md
   advanced_guides/general_math.md
   advanced_guides/code_eval.md
   advanced_guides/code_eval_service.md
   advanced_guides/prompt_attack.md
   advanced_guides/longeval.md
   advanced_guides/subjective_evaluation.md
   advanced_guides/circular_eval.md
   advanced_guides/contamination_eval.md
   advanced_guides/compassbench_intro.md
   advanced_guides/needleinahaystack_eval.md
 .. _工具: