mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00
update_doc
This commit is contained in:
parent
046b6f75c6
commit
5b9f4a4e7b
190
docs/en/advanced_guides/general_math.md
Normal file
190
docs/en/advanced_guides/general_math.md
Normal file
@ -0,0 +1,190 @@
|
|||||||
|
# General Math Evaluation Guidance
|
||||||
|
|
||||||
|
## Introduction
|
||||||
|
|
||||||
|
Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHEvaluator components.
|
||||||
|
|
||||||
|
## Dataset Format
|
||||||
|
|
||||||
|
The math evaluation dataset should be in either JSON Lines (.jsonl) or CSV format. Each problem should contain at least:
|
||||||
|
|
||||||
|
- A problem statement
|
||||||
|
- A solution/answer (typically in LaTeX format with the final answer in \\boxed{})
|
||||||
|
|
||||||
|
Example JSONL format:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{"problem": "Find the value of x if 2x + 3 = 7", "solution": "Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"}
|
||||||
|
```
|
||||||
|
|
||||||
|
Example CSV format:
|
||||||
|
|
||||||
|
```csv
|
||||||
|
problem,solution
|
||||||
|
"Find the value of x if 2x + 3 = 7","Let's solve step by step:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\nTherefore, \\boxed{2}"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
To evaluate mathematical reasoning, you'll need to set up three main components:
|
||||||
|
|
||||||
|
1. Dataset Reader Configuration
|
||||||
|
|
||||||
|
```python
|
||||||
|
math_reader_cfg = dict(
|
||||||
|
input_columns=['problem'], # Column name for the question
|
||||||
|
output_column='solution' # Column name for the answer
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Inference Configuration
|
||||||
|
|
||||||
|
```python
|
||||||
|
math_infer_cfg = dict(
|
||||||
|
prompt_template=dict(
|
||||||
|
type=PromptTemplate,
|
||||||
|
template=dict(
|
||||||
|
round=[
|
||||||
|
dict(
|
||||||
|
role='HUMAN',
|
||||||
|
prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',
|
||||||
|
),
|
||||||
|
]
|
||||||
|
),
|
||||||
|
),
|
||||||
|
retriever=dict(type=ZeroRetriever),
|
||||||
|
inferencer=dict(type=GenInferencer),
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Evaluation Configuration
|
||||||
|
|
||||||
|
```python
|
||||||
|
math_eval_cfg = dict(
|
||||||
|
evaluator=dict(type=MATHEvaluator),
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Using CustomDataset
|
||||||
|
|
||||||
|
Here's how to set up a complete configuration for math evaluation:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from mmengine.config import read_base
|
||||||
|
from opencompass.models import TurboMindModelwithChatTemplate
|
||||||
|
from opencompass.datasets import CustomDataset
|
||||||
|
|
||||||
|
math_datasets = [
|
||||||
|
dict(
|
||||||
|
type=CustomDataset,
|
||||||
|
abbr='my-math-dataset', # Dataset abbreviation
|
||||||
|
path='path/to/your/dataset', # Path to your dataset file
|
||||||
|
reader_cfg=math_reader_cfg,
|
||||||
|
infer_cfg=math_infer_cfg,
|
||||||
|
eval_cfg=math_eval_cfg,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
## MATHEvaluator
|
||||||
|
|
||||||
|
The MATHEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.
|
||||||
|
|
||||||
|
The MATHEvaluator implements:
|
||||||
|
|
||||||
|
1. Extracts answers from both predictions and references using LaTeX extraction
|
||||||
|
2. Handles various LaTeX formats and environments
|
||||||
|
3. Verifies mathematical equivalence between predicted and reference answers
|
||||||
|
4. Provides detailed evaluation results including:
|
||||||
|
- Accuracy score
|
||||||
|
- Detailed comparison between predictions and references
|
||||||
|
- Parse results of both predicted and reference answers
|
||||||
|
|
||||||
|
The evaluator supports:
|
||||||
|
|
||||||
|
- Basic arithmetic operations
|
||||||
|
- Fractions and decimals
|
||||||
|
- Algebraic expressions
|
||||||
|
- Trigonometric functions
|
||||||
|
- Roots and exponents
|
||||||
|
- Mathematical symbols and operators
|
||||||
|
|
||||||
|
Example evaluation output:
|
||||||
|
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
'accuracy': 85.0, # Percentage of correct answers
|
||||||
|
'details': [
|
||||||
|
{
|
||||||
|
'predictions': 'x = 2', # Parsed prediction
|
||||||
|
'references': 'x = 2', # Parsed reference
|
||||||
|
'correct': True # Whether they match
|
||||||
|
},
|
||||||
|
# ... more results
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Complete Example
|
||||||
|
|
||||||
|
Here's a complete example of how to set up math evaluation:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from mmengine.config import read_base
|
||||||
|
from opencompass.models import TurboMindModelwithChatTemplate
|
||||||
|
from opencompass.datasets import CustomDataset
|
||||||
|
from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
|
||||||
|
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||||
|
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||||
|
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||||
|
|
||||||
|
# Dataset reader configuration
|
||||||
|
math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
|
||||||
|
|
||||||
|
# Inference configuration
|
||||||
|
math_infer_cfg = dict(
|
||||||
|
prompt_template=dict(
|
||||||
|
type=PromptTemplate,
|
||||||
|
template=dict(
|
||||||
|
round=[
|
||||||
|
dict(
|
||||||
|
role='HUMAN',
|
||||||
|
prompt='{problem}\nPlease reason step by step, and put your final answer within \\boxed{}.',
|
||||||
|
),
|
||||||
|
]
|
||||||
|
),
|
||||||
|
),
|
||||||
|
retriever=dict(type=ZeroRetriever),
|
||||||
|
inferencer=dict(type=GenInferencer),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Evaluation configuration
|
||||||
|
math_eval_cfg = dict(
|
||||||
|
evaluator=dict(type=MATHEvaluator),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Dataset configuration
|
||||||
|
math_datasets = [
|
||||||
|
dict(
|
||||||
|
type=CustomDataset,
|
||||||
|
abbr='my-math-dataset',
|
||||||
|
path='path/to/your/dataset.jsonl', # or .csv
|
||||||
|
reader_cfg=math_reader_cfg,
|
||||||
|
infer_cfg=math_infer_cfg,
|
||||||
|
eval_cfg=math_eval_cfg,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
# Model configuration
|
||||||
|
models = [
|
||||||
|
dict(
|
||||||
|
type=TurboMindModelwithChatTemplate,
|
||||||
|
abbr='your-model-name',
|
||||||
|
path='your/model/path',
|
||||||
|
# ... other model configurations
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
# Output directory
|
||||||
|
work_dir = './outputs/math_eval'
|
||||||
|
```
|
@ -40,7 +40,6 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
|
|||||||
user_guides/experimentation.md
|
user_guides/experimentation.md
|
||||||
user_guides/metrics.md
|
user_guides/metrics.md
|
||||||
user_guides/summarizer.md
|
user_guides/summarizer.md
|
||||||
user_guides/corebench.md
|
|
||||||
|
|
||||||
.. _Prompt:
|
.. _Prompt:
|
||||||
.. toctree::
|
.. toctree::
|
||||||
@ -61,16 +60,12 @@ We always welcome *PRs* and *Issues* for the betterment of OpenCompass.
|
|||||||
advanced_guides/new_dataset.md
|
advanced_guides/new_dataset.md
|
||||||
advanced_guides/custom_dataset.md
|
advanced_guides/custom_dataset.md
|
||||||
advanced_guides/new_model.md
|
advanced_guides/new_model.md
|
||||||
advanced_guides/evaluation_lmdeploy.md
|
|
||||||
advanced_guides/evaluation_lightllm.md
|
|
||||||
advanced_guides/accelerator_intro.md
|
advanced_guides/accelerator_intro.md
|
||||||
|
advanced_guides/general_math.md
|
||||||
advanced_guides/code_eval.md
|
advanced_guides/code_eval.md
|
||||||
advanced_guides/code_eval_service.md
|
advanced_guides/code_eval_service.md
|
||||||
advanced_guides/prompt_attack.md
|
|
||||||
advanced_guides/longeval.md
|
|
||||||
advanced_guides/subjective_evaluation.md
|
advanced_guides/subjective_evaluation.md
|
||||||
advanced_guides/circular_eval.md
|
advanced_guides/circular_eval.md
|
||||||
advanced_guides/contamination_eval.md
|
|
||||||
advanced_guides/needleinahaystack_eval.md
|
advanced_guides/needleinahaystack_eval.md
|
||||||
|
|
||||||
.. _Tools:
|
.. _Tools:
|
||||||
|
190
docs/zh_cn/advanced_guides/general_math.md
Normal file
190
docs/zh_cn/advanced_guides/general_math.md
Normal file
@ -0,0 +1,190 @@
|
|||||||
|
# 数学能力评测
|
||||||
|
|
||||||
|
## 简介
|
||||||
|
|
||||||
|
数学推理能力是大语言模型(LLMs)的一项关键能力。为了评估模型的数学能力,我们需要测试其逐步解决数学问题并提供准确最终答案的能力。OpenCompass 通过 CustomDataset 和 MATHEvaluator 组件提供了一种便捷的数学推理评测方式。
|
||||||
|
|
||||||
|
## 数据集格式
|
||||||
|
|
||||||
|
数学评测数据集应该是 JSON Lines (.jsonl) 或 CSV 格式。每个问题至少应包含:
|
||||||
|
|
||||||
|
- 问题陈述
|
||||||
|
- 解答/答案(通常使用 LaTeX 格式,最终答案需要用 \\boxed{} 括起来)
|
||||||
|
|
||||||
|
JSONL 格式示例:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{"problem": "求解方程 2x + 3 = 7", "solution": "让我们逐步解决:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\n因此,\\boxed{2}"}
|
||||||
|
```
|
||||||
|
|
||||||
|
CSV 格式示例:
|
||||||
|
|
||||||
|
```csv
|
||||||
|
problem,solution
|
||||||
|
"求解方程 2x + 3 = 7","让我们逐步解决:\n2x + 3 = 7\n2x = 7 - 3\n2x = 4\nx = 2\n因此,\\boxed{2}"
|
||||||
|
```
|
||||||
|
|
||||||
|
## 配置说明
|
||||||
|
|
||||||
|
要进行数学推理评测,你需要设置三个主要组件:
|
||||||
|
|
||||||
|
1. 数据集读取配置
|
||||||
|
|
||||||
|
```python
|
||||||
|
math_reader_cfg = dict(
|
||||||
|
input_columns=['problem'], # 问题列的名称
|
||||||
|
output_column='solution' # 答案列的名称
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
2. 推理配置
|
||||||
|
|
||||||
|
```python
|
||||||
|
math_infer_cfg = dict(
|
||||||
|
prompt_template=dict(
|
||||||
|
type=PromptTemplate,
|
||||||
|
template=dict(
|
||||||
|
round=[
|
||||||
|
dict(
|
||||||
|
role='HUMAN',
|
||||||
|
prompt='{problem}\n请逐步推理,并将最终答案放在 \\boxed{} 中。',
|
||||||
|
),
|
||||||
|
]
|
||||||
|
),
|
||||||
|
),
|
||||||
|
retriever=dict(type=ZeroRetriever),
|
||||||
|
inferencer=dict(type=GenInferencer),
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
3. 评测配置
|
||||||
|
|
||||||
|
```python
|
||||||
|
math_eval_cfg = dict(
|
||||||
|
evaluator=dict(type=MATHEvaluator),
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 使用 CustomDataset
|
||||||
|
|
||||||
|
以下是如何设置完整的数学评测配置:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from mmengine.config import read_base
|
||||||
|
from opencompass.models import TurboMindModelwithChatTemplate
|
||||||
|
from opencompass.datasets import CustomDataset
|
||||||
|
|
||||||
|
math_datasets = [
|
||||||
|
dict(
|
||||||
|
type=CustomDataset,
|
||||||
|
abbr='my-math-dataset', # 数据集简称
|
||||||
|
path='path/to/your/dataset', # 数据集文件路径
|
||||||
|
reader_cfg=math_reader_cfg,
|
||||||
|
infer_cfg=math_infer_cfg,
|
||||||
|
eval_cfg=math_eval_cfg,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
## MATHEvaluator
|
||||||
|
|
||||||
|
MATHEvaluator 是专门设计用于评估数学答案的评测器。它基于 math_verify 库进行开发,该库提供了数学表达式解析和验证功能,支持 LaTeX 和一般表达式的提取与等价性验证。
|
||||||
|
|
||||||
|
MATHEvaluator 具有以下功能:
|
||||||
|
|
||||||
|
1. 使用 LaTeX 提取器从预测和参考答案中提取答案
|
||||||
|
2. 处理各种 LaTeX 格式和环境
|
||||||
|
3. 验证预测答案和参考答案之间的数学等价性
|
||||||
|
4. 提供详细的评测结果,包括:
|
||||||
|
- 准确率分数
|
||||||
|
- 预测和参考答案的详细比较
|
||||||
|
- 预测和参考答案的解析结果
|
||||||
|
|
||||||
|
评测器支持:
|
||||||
|
|
||||||
|
- 基本算术运算
|
||||||
|
- 分数和小数
|
||||||
|
- 代数表达式
|
||||||
|
- 三角函数
|
||||||
|
- 根式和指数
|
||||||
|
- 数学符号和运算符
|
||||||
|
|
||||||
|
评测输出示例:
|
||||||
|
|
||||||
|
```python
|
||||||
|
{
|
||||||
|
'accuracy': 85.0, # 正确答案的百分比
|
||||||
|
'details': [
|
||||||
|
{
|
||||||
|
'predictions': 'x = 2', # 解析后的预测答案
|
||||||
|
'references': 'x = 2', # 解析后的参考答案
|
||||||
|
'correct': True # 是否匹配
|
||||||
|
},
|
||||||
|
# ... 更多结果
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## 完整示例
|
||||||
|
|
||||||
|
以下是设置数学评测的完整示例:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from mmengine.config import read_base
|
||||||
|
from opencompass.models import TurboMindModelwithChatTemplate
|
||||||
|
from opencompass.datasets import CustomDataset
|
||||||
|
from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
|
||||||
|
from opencompass.openicl.icl_prompt_template import PromptTemplate
|
||||||
|
from opencompass.openicl.icl_retriever import ZeroRetriever
|
||||||
|
from opencompass.openicl.icl_inferencer import GenInferencer
|
||||||
|
|
||||||
|
# 数据集读取配置
|
||||||
|
math_reader_cfg = dict(input_columns=['problem'], output_column='solution')
|
||||||
|
|
||||||
|
# 推理配置
|
||||||
|
math_infer_cfg = dict(
|
||||||
|
prompt_template=dict(
|
||||||
|
type=PromptTemplate,
|
||||||
|
template=dict(
|
||||||
|
round=[
|
||||||
|
dict(
|
||||||
|
role='HUMAN',
|
||||||
|
prompt='{problem}\n请逐步推理,并将最终答案放在 \\boxed{} 中。',
|
||||||
|
),
|
||||||
|
]
|
||||||
|
),
|
||||||
|
),
|
||||||
|
retriever=dict(type=ZeroRetriever),
|
||||||
|
inferencer=dict(type=GenInferencer),
|
||||||
|
)
|
||||||
|
|
||||||
|
# 评测配置
|
||||||
|
math_eval_cfg = dict(
|
||||||
|
evaluator=dict(type=MATHEvaluator),
|
||||||
|
)
|
||||||
|
|
||||||
|
# 数据集配置
|
||||||
|
math_datasets = [
|
||||||
|
dict(
|
||||||
|
type=CustomDataset,
|
||||||
|
abbr='my-math-dataset',
|
||||||
|
path='path/to/your/dataset.jsonl', # 或 .csv
|
||||||
|
reader_cfg=math_reader_cfg,
|
||||||
|
infer_cfg=math_infer_cfg,
|
||||||
|
eval_cfg=math_eval_cfg,
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
# 模型配置
|
||||||
|
models = [
|
||||||
|
dict(
|
||||||
|
type=TurboMindModelwithChatTemplate,
|
||||||
|
abbr='your-model-name',
|
||||||
|
path='your/model/path',
|
||||||
|
# ... 其他模型配置
|
||||||
|
)
|
||||||
|
]
|
||||||
|
|
||||||
|
# 输出目录
|
||||||
|
work_dir = './outputs/math_eval'
|
||||||
|
```
|
@ -41,7 +41,6 @@ OpenCompass 上手路线
|
|||||||
user_guides/experimentation.md
|
user_guides/experimentation.md
|
||||||
user_guides/metrics.md
|
user_guides/metrics.md
|
||||||
user_guides/summarizer.md
|
user_guides/summarizer.md
|
||||||
user_guides/corebench.md
|
|
||||||
|
|
||||||
.. _提示词:
|
.. _提示词:
|
||||||
.. toctree::
|
.. toctree::
|
||||||
@ -61,17 +60,12 @@ OpenCompass 上手路线
|
|||||||
advanced_guides/new_dataset.md
|
advanced_guides/new_dataset.md
|
||||||
advanced_guides/custom_dataset.md
|
advanced_guides/custom_dataset.md
|
||||||
advanced_guides/new_model.md
|
advanced_guides/new_model.md
|
||||||
advanced_guides/evaluation_lmdeploy.md
|
|
||||||
advanced_guides/evaluation_lightllm.md
|
|
||||||
advanced_guides/accelerator_intro.md
|
advanced_guides/accelerator_intro.md
|
||||||
|
advanced_guides/general_math.md
|
||||||
advanced_guides/code_eval.md
|
advanced_guides/code_eval.md
|
||||||
advanced_guides/code_eval_service.md
|
advanced_guides/code_eval_service.md
|
||||||
advanced_guides/prompt_attack.md
|
|
||||||
advanced_guides/longeval.md
|
|
||||||
advanced_guides/subjective_evaluation.md
|
advanced_guides/subjective_evaluation.md
|
||||||
advanced_guides/circular_eval.md
|
advanced_guides/circular_eval.md
|
||||||
advanced_guides/contamination_eval.md
|
|
||||||
advanced_guides/compassbench_intro.md
|
|
||||||
advanced_guides/needleinahaystack_eval.md
|
advanced_guides/needleinahaystack_eval.md
|
||||||
|
|
||||||
.. _工具:
|
.. _工具:
|
||||||
|
Loading…
Reference in New Issue
Block a user