diff --git a/docs/en/advanced_guides/general_math.md b/docs/en/advanced_guides/general_math.md new file mode 100644 index 00000000..da9cfd2f --- /dev/null +++ b/docs/en/advanced_guides/general_math.md @@ -0,0 +1,190 @@ +# General Math Evaluation Guidance + +## Introduction + +Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHEvaluator components. + +## Dataset Format + +The math evaluation dataset should be in either JSON Lines (.jsonl) or CSV format. Each problem should contain at least: + +- A problem statement +- A solution/answer (typically in LaTeX format with the final answer in \\boxed{}) + +Example JSONL format: + +```json +{"problem": "Find the value of x if 2x + 3 = 7", "solution": "Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"} +``` + +Example CSV format: + +```csv +problem,solution +"Find the value of x if 2x + 3 = 7","Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}" +``` + +## Configuration + +To evaluate mathematical reasoning, you'll need to set up three main components: + +1. Dataset Reader Configuration + +```python +math_reader_cfg = dict( + input_columns=['problem'], # Column name for the question + output_column='solution' # Column name for the answer +) +``` + +2. Inference Configuration + +```python +math_infer_cfg = dict( + prompt_template=dict( + type=PromptTemplate, + template=dict( + round=[ + dict( + role='HUMAN', + prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.', + ), + ] + ), + ), + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), +) +``` + +3. Evaluation Configuration + +```python +math_eval_cfg = dict( + evaluator=dict(type=MATHEvaluator), +) +``` + +## Using CustomDataset + +Here's how to set up a complete configuration for math evaluation: + +```python +from mmengine.config import read_base +from opencompass.models import TurboMindModelwithChatTemplate +from opencompass.datasets import CustomDataset + +math_datasets = [ + dict( + type=CustomDataset, + abbr='my-math-dataset', # Dataset abbreviation + path='path/to/your/dataset', # Path to your dataset file + reader_cfg=math_reader_cfg, + infer_cfg=math_infer_cfg, + eval_cfg=math_eval_cfg, + ) +] +``` + +## MATHEvaluator + +The MATHEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions. + +The MATHEvaluator implements: + +1. Extracts answers from both predictions and references using LaTeX extraction +2. Handles various LaTeX formats and environments +3. Verifies mathematical equivalence between predicted and reference answers +4. Provides detailed evaluation results including: + - Accuracy score + - Detailed comparison between predictions and references + - Parse results of both predicted and reference answers + +The evaluator supports: + +- Basic arithmetic operations +- Fractions and decimals +- Algebraic expressions +- Trigonometric functions +- Roots and exponents +- Mathematical symbols and operators + +Example evaluation output: + +```python +{ + 'accuracy': 85.0, # Percentage of correct answers + 'details': [ + { + 'predictions': 'x = 2', # Parsed prediction + 'references': 'x = 2', # Parsed reference + 'correct': True # Whether they match + }, + # ... more results + ] +} +``` + +## Complete Example + +Here's a complete example of how to set up math evaluation: + +```python +from mmengine.config import read_base +from opencompass.models import TurboMindModelwithChatTemplate +from opencompass.datasets import CustomDataset +from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator +from opencompass.openicl.icl_prompt_template import PromptTemplate +from opencompass.openicl.icl_retriever import ZeroRetriever +from opencompass.openicl.icl_inferencer import GenInferencer + +# Dataset reader configuration +math_reader_cfg = dict(input_columns=['problem'], output_column='solution') + +# Inference configuration +math_infer_cfg = dict( + prompt_template=dict( + type=PromptTemplate, + template=dict( + round=[ + dict( + role='HUMAN', + prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.', + ), + ] + ), + ), + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), +) + +# Evaluation configuration +math_eval_cfg = dict( + evaluator=dict(type=MATHEvaluator), +) + +# Dataset configuration +math_datasets = [ + dict( + type=CustomDataset, + abbr='my-math-dataset', + path='path/to/your/dataset.jsonl', # or .csv + reader_cfg=math_reader_cfg, + infer_cfg=math_infer_cfg, + eval_cfg=math_eval_cfg, + ) +] + +# Model configuration +models = [ + dict( + type=TurboMindModelwithChatTemplate, + abbr='your-model-name', + path='your/model/path', + # ... other model configurations + ) +] + +# Output directory +work_dir = './outputs/math_eval' +``` diff --git a/docs/en/index.rst b/docs/en/index.rst index 7181c459..0b15a2b8 100644 --- a/docs/en/index.rst +++ b/docs/en/index.rst @@ -40,7 +40,6 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass. user_guides/experimentation.md user_guides/metrics.md user_guides/summarizer.md - user_guides/corebench.md .. _Prompt: .. toctree:: @@ -61,16 +60,12 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass. advanced_guides/new_dataset.md advanced_guides/custom_dataset.md advanced_guides/new_model.md - advanced_guides/evaluation_lmdeploy.md - advanced_guides/evaluation_lightllm.md advanced_guides/accelerator_intro.md + advanced_guides/general_math.md advanced_guides/code_eval.md advanced_guides/code_eval_service.md - advanced_guides/prompt_attack.md - advanced_guides/longeval.md advanced_guides/subjective_evaluation.md advanced_guides/circular_eval.md - advanced_guides/contamination_eval.md advanced_guides/needleinahaystack_eval.md .. _Tools: diff --git a/docs/zh_cn/advanced_guides/general_math.md b/docs/zh_cn/advanced_guides/general_math.md new file mode 100644 index 00000000..8e8d2fa6 --- /dev/null +++ b/docs/zh_cn/advanced_guides/general_math.md @@ -0,0 +1,190 @@ +# 数学能力评测 + +## 简介 + +数学推理能力是大语言模型(LLMs)的一项关键能力。为了评估模型的数学能力,我们需要测试其逐步解决数学问题并提供准确最终答案的能力。OpenCompass 通过 CustomDataset 和 MATHEvaluator 组件提供了一种便捷的数学推理评测方式。 + +## 数据集格式 + +数学评测数据集应该是 JSON Lines (.jsonl) 或 CSV 格式。每个问题至少应包含: + +- 问题陈述 +- 解答/答案(通常使用 LaTeX 格式,最终答案需要用 \\boxed{} 括起来) + +JSONL 格式示例: + +```json +{"problem": "求解方程 2x + 3 = 7", "solution": "让我们逐步解决:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\n因此,\\boxed{2}"} +``` + +CSV 格式示例: + +```csv +problem,solution +"求解方程 2x + 3 = 7","让我们逐步解决:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\n因此,\\boxed{2}" +``` + +## 配置说明 + +要进行数学推理评测,你需要设置三个主要组件: + +1. 数据集读取配置 + +```python +math_reader_cfg = dict( + input_columns=['problem'], # 问题列的名称 + output_column='solution' # 答案列的名称 +) +``` + +2. 推理配置 + +```python +math_infer_cfg = dict( + prompt_template=dict( + type=PromptTemplate, + template=dict( + round=[ + dict( + role='HUMAN', + prompt='{problem}\n请逐步推理,并将最终答案放在 \\boxed{} 中。', + ), + ] + ), + ), + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), +) +``` + +3. 评测配置 + +```python +math_eval_cfg = dict( + evaluator=dict(type=MATHEvaluator), +) +``` + +## 使用 CustomDataset + +以下是如何设置完整的数学评测配置: + +```python +from mmengine.config import read_base +from opencompass.models import TurboMindModelwithChatTemplate +from opencompass.datasets import CustomDataset + +math_datasets = [ + dict( + type=CustomDataset, + abbr='my-math-dataset', # 数据集简称 + path='path/to/your/dataset', # 数据集文件路径 + reader_cfg=math_reader_cfg, + infer_cfg=math_infer_cfg, + eval_cfg=math_eval_cfg, + ) +] +``` + +## MATHEvaluator + +MATHEvaluator 是专门设计用于评估数学答案的评测器。它基于 math_verify 库进行开发,该库提供了数学表达式解析和验证功能,支持 LaTeX 和一般表达式的提取与等价性验证。 + +MATHEvaluator 具有以下功能: + +1. 使用 LaTeX 提取器从预测和参考答案中提取答案 +2. 处理各种 LaTeX 格式和环境 +3. 验证预测答案和参考答案之间的数学等价性 +4. 提供详细的评测结果,包括: + - 准确率分数 + - 预测和参考答案的详细比较 + - 预测和参考答案的解析结果 + +评测器支持: + +- 基本算术运算 +- 分数和小数 +- 代数表达式 +- 三角函数 +- 根式和指数 +- 数学符号和运算符 + +评测输出示例: + +```python +{ + 'accuracy': 85.0, # 正确答案的百分比 + 'details': [ + { + 'predictions': 'x = 2', # 解析后的预测答案 + 'references': 'x = 2', # 解析后的参考答案 + 'correct': True # 是否匹配 + }, + # ... 更多结果 + ] +} +``` + +## 完整示例 + +以下是设置数学评测的完整示例: + +```python +from mmengine.config import read_base +from opencompass.models import TurboMindModelwithChatTemplate +from opencompass.datasets import CustomDataset +from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator +from opencompass.openicl.icl_prompt_template import PromptTemplate +from opencompass.openicl.icl_retriever import ZeroRetriever +from opencompass.openicl.icl_inferencer import GenInferencer + +# 数据集读取配置 +math_reader_cfg = dict(input_columns=['problem'], output_column='solution') + +# 推理配置 +math_infer_cfg = dict( + prompt_template=dict( + type=PromptTemplate, + template=dict( + round=[ + dict( + role='HUMAN', + prompt='{problem}\n请逐步推理,并将最终答案放在 \\boxed{} 中。', + ), + ] + ), + ), + retriever=dict(type=ZeroRetriever), + inferencer=dict(type=GenInferencer), +) + +# 评测配置 +math_eval_cfg = dict( + evaluator=dict(type=MATHEvaluator), +) + +# 数据集配置 +math_datasets = [ + dict( + type=CustomDataset, + abbr='my-math-dataset', + path='path/to/your/dataset.jsonl', # 或 .csv + reader_cfg=math_reader_cfg, + infer_cfg=math_infer_cfg, + eval_cfg=math_eval_cfg, + ) +] + +# 模型配置 +models = [ + dict( + type=TurboMindModelwithChatTemplate, + abbr='your-model-name', + path='your/model/path', + # ... 其他模型配置 + ) +] + +# 输出目录 +work_dir = './outputs/math_eval' +``` diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst index 827c7d91..8c6620ca 100644 --- a/docs/zh_cn/index.rst +++ b/docs/zh_cn/index.rst @@ -41,7 +41,6 @@ OpenCompass 上手路线 user_guides/experimentation.md user_guides/metrics.md user_guides/summarizer.md - user_guides/corebench.md .. _提示词: .. toctree:: @@ -61,17 +60,12 @@ OpenCompass 上手路线 advanced_guides/new_dataset.md advanced_guides/custom_dataset.md advanced_guides/new_model.md - advanced_guides/evaluation_lmdeploy.md - advanced_guides/evaluation_lightllm.md advanced_guides/accelerator_intro.md + advanced_guides/general_math.md advanced_guides/code_eval.md advanced_guides/code_eval_service.md - advanced_guides/prompt_attack.md - advanced_guides/longeval.md advanced_guides/subjective_evaluation.md advanced_guides/circular_eval.md - advanced_guides/contamination_eval.md - advanced_guides/compassbench_intro.md advanced_guides/needleinahaystack_eval.md .. _工具: