chore: fix line endings and formatting; add maritime_bench dataset

This commit is contained in:
zhanghaoyu 2025-04-10 13:40:28 +08:00
parent 12213207b6
commit b0bbd4a96a
267 changed files with 11133 additions and 11021 deletions

View File

@ -120,4 +120,4 @@ repos:
# hooks:
# - id: check-algo-readme
# - id: check-copyright
# args: ["mmocr", "tests", "tools"] # these directories will be checked
# args: ["mmocr", "tests", "tools"] # these directories will be checked

View File

@ -120,4 +120,4 @@ repos:
# hooks:
# - id: check-algo-readme
# - id: check-copyright
# args: ["mmocr", "tests", "tools"] # these directories will be checked
# args: ["mmocr", "tests", "tools"] # these directories will be checked

View File

@ -231,4 +231,4 @@ def builder_inited_handler(app):
def setup(app):
app.connect('builder-inited', builder_inited_handler)
app.connect('builder-inited', builder_inited_handler)

View File

@ -1,194 +1,194 @@
# CompassBench 介绍
## CompassBench 2.0 v1.3 版本
CompassBench官方自建榜单经历了多次更新迭代从2024年7月起OpenCompass将会公布自建榜单的评测规则(评测配置文件)和示例数据集文件,以帮助社区更好的了解自建榜单的评测逻辑和方法。
### 能力维度
2024年8月榜单将会包括以下能力维度
| 能力 | 任务介绍 | 评测方式 | 示例数据地址 |
| -------- | -------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------------------------------ |
| 语言 | 评测模型在信息抽取、信息抽取、内容总结、对话、创作等多种任务上的能力 | 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/language |
| 推理 | 评测模型在逻辑推理、常识推理、表格推理等多种日常推理任务上的能力 | 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/reasoning |
| 知识 | 评测模型在理科、工科、人文社科等多个领域的知识水平 | 客观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/knowledge |
| 数学 | 评测模型在数值计算、高中及大学难度的数学问题上的能力 | 客观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/math |
| 代码 | 评测模型在代码生成、代码补全、代码注释、代码重构、代码改写、计算机知识综合问答上的能力 | 客观评测 + 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/code |
| 指令跟随 | 评测模型在基于各类语言、推理、知识等任务中,能否准确遵循复杂指令的能力 | 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/instruct |
| 智能体 | 评估模型在复杂工具调用的能力,以及数据科学、数学等情况下使用代码解释器的能力 | 客观评测 | https://github.com/open-compass/T-Eval https://github.com/open-compass/CIBench |
### 评测方法
- 对于客观评测将会采用0-shot + CoT的方式评测。
- OpenCompass在客观题评测的后处理上已进行较多优化并在评测时在Prompt中对回答格式进行约束对于因指令跟随问题带来的无法完成答案提取的情况将视为回答错误
- 数学、智能体题目类型与给定的示例数据类似,但真实评测数据与开源数据不同
- 对于主观评测,将会采用基于大模型评价的方式进行评测。
- 我们对每一道问题均提供评测时的打分指引。
- 比较待测模型相对于参考回复的胜率,共设置为五档
- `A++`回答A远胜于回答B。
- `A+`回答A略优于回答B。
- `A=B`回答A和回答B质量相同。
- `B+`回答B略优于回答A。
- `B++`回答B远胜于回答A。
- 主观评测配置文件
- [示例评测配置](https://github.com/open-compass/opencompass/blob/main/configs/eval_compassbench_v1_3_subjective.py)
- 主观评价提示词
```
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the \
responses generated by two AI models.
We will provide you with the user query and a pair of AI-generated \
responses (Response A and Response B).
You should first read the user query and the conversation history \
carefully for analyzing the task, and then evaluate the quality of the \
responses based on and rules provided below.
# Conversation between User and AI
## User Query
<|begin_of_query|>
{question}
<|end_of_query|>
## Response A
<|begin_of_response_A|>
{prediction}
<|end_of_response_A|>
## Response B
<|begin_of_response_B|>
{prediction2}
<|end_of_response_B|>
# Evaluation
## Checklist
<|begin_of_checklist|>
{checklist}
<|end_of_checklist|>
Please use this checklist to guide your evaluation, but do not limit your \
assessment to the checklist.
## Rules
You should compare the above two responses based on your analysis of the \
user queries and the conversation history.
You should first write down your analysis and the checklist that you used \
for the evaluation, and then provide your assessment according to the \
checklist.
There are five choices to give your final assessment: ["A++", "A+", \
"A=B", "B+", "B++"], which correspond to the following meanings:
- `A++`: Response A is much better than Response B.
- `A+`: Response A is only slightly better than Response B.
- `A=B`: Response A and B are of the same quality. Please use this \
choice sparingly.
- `B+`: Response B is only slightly better than Response A.
- `B++`: Response B is much better than Response A.
## Output Format
First, please output your analysis for each model response, and \
then summarize your assessment to three aspects: "reason A=B", \
"reason A>B", and "reason B>A", and finally make your choice for \
the final assessment.
Please provide your evaluation results in the following json \
format by filling in the placeholders in []:
{
"analysis of A": "[analysis of Response A]",
"analysis of B": "[analysis of Response B]",
"reason of A=B": "[where Response A and B perform equally well]",
"reason of A>B": "[where Response A is better than Response B]",
"reason of B>A": "[where Response B is better than Response A]",
"choice": "[A++ or A+ or A=B or B+ or B++]",
}
# 指令
您是一位专业评估专家。您的任务是评估两个AI模型生成回答的质量。
我们将为您提供用户问题及一对AI生成的回答回答A和回答B
您应当首先仔细阅读用户问题,然后根据以下提供的规则评估回答的质量。
# 用户与AI之间的对话
## 用户问题
<|begin_of_query|>
{question}
<|end_of_query|>
## 回答A
<|begin_of_response_A|>
{prediction}
<|end_of_response_A|>
## 回答B
<|begin_of_response_B|>
{prediction2}
<|end_of_response_B|>
# 评估
## 检查清单
<|begin_of_checklist|>
{checklist}
<|end_of_checklist|>
请参考此检查清单来评估回答的质量,但不要局限于此检查清单。
## 规则
您应当基于用户查询,分析比较上述两种回答。
您应当基于检查清单写下您的分析,然后提供您的评价。
有五个选项供您做出最终评估:["A++", "A+", "A=B", "B+", "B++"],它们对应如下含义:
- `A++`回答A远胜于回答B。
- `A+`回答A略优于回答B。
- `A=B`回答A和回答B质量相同。请谨慎使用此选项。
- `B+`回答B略优于回答A。
- `B++`回答B远胜于回答A。
## 输出格式
首先,请输出您对每个模型回答的分析,
然后总结您的评估到三个方面:"A=B的理由""A优于B的理由",和 "B优于A的理由"
最后做出您对最终评估的选择。
请按照以下json格式提供您的评估结果通过填充[]中的占位符:
{
"回答A的分析": "[回答A的分析]",
"回答B的分析": "[回答B的分析]",
"A=B的理由": "[A和B回答差不多的理由]",
"A优于B的理由": "[回答A优于B的理由]",
"B优于A的理由": "[回答B优于A的理由]",
"choice": "[A++ or A+ or A=B or B+ or B++]",
}
```
# CompassBench 介绍
## CompassBench 2.0 v1.3 版本
CompassBench官方自建榜单经历了多次更新迭代从2024年7月起OpenCompass将会公布自建榜单的评测规则(评测配置文件)和示例数据集文件,以帮助社区更好的了解自建榜单的评测逻辑和方法。
### 能力维度
2024年8月榜单将会包括以下能力维度
| 能力 | 任务介绍 | 评测方式 | 示例数据地址 |
| -------- | -------------------------------------------------------------------------------------- | ------------------- | ------------------------------------------------------------------------------ |
| 语言 | 评测模型在信息抽取、信息抽取、内容总结、对话、创作等多种任务上的能力 | 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/language |
| 推理 | 评测模型在逻辑推理、常识推理、表格推理等多种日常推理任务上的能力 | 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/reasoning |
| 知识 | 评测模型在理科、工科、人文社科等多个领域的知识水平 | 客观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/knowledge |
| 数学 | 评测模型在数值计算、高中及大学难度的数学问题上的能力 | 客观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/math |
| 代码 | 评测模型在代码生成、代码补全、代码注释、代码重构、代码改写、计算机知识综合问答上的能力 | 客观评测 + 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/code |
| 指令跟随 | 评测模型在基于各类语言、推理、知识等任务中,能否准确遵循复杂指令的能力 | 主观评测 | https://github.com/open-compass/CompassBench/tree/main/v1_3_data/instruct |
| 智能体 | 评估模型在复杂工具调用的能力,以及数据科学、数学等情况下使用代码解释器的能力 | 客观评测 | https://github.com/open-compass/T-Eval https://github.com/open-compass/CIBench |
### 评测方法
- 对于客观评测将会采用0-shot + CoT的方式评测。
- OpenCompass在客观题评测的后处理上已进行较多优化并在评测时在Prompt中对回答格式进行约束对于因指令跟随问题带来的无法完成答案提取的情况将视为回答错误
- 数学、智能体题目类型与给定的示例数据类似,但真实评测数据与开源数据不同
- 对于主观评测,将会采用基于大模型评价的方式进行评测。
- 我们对每一道问题均提供评测时的打分指引。
- 比较待测模型相对于参考回复的胜率,共设置为五档
- `A++`回答A远胜于回答B。
- `A+`回答A略优于回答B。
- `A=B`回答A和回答B质量相同。
- `B+`回答B略优于回答A。
- `B++`回答B远胜于回答A。
- 主观评测配置文件
- [示例评测配置](https://github.com/open-compass/opencompass/blob/main/configs/eval_compassbench_v1_3_subjective.py)
- 主观评价提示词
```
# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the \
responses generated by two AI models.
We will provide you with the user query and a pair of AI-generated \
responses (Response A and Response B).
You should first read the user query and the conversation history \
carefully for analyzing the task, and then evaluate the quality of the \
responses based on and rules provided below.
# Conversation between User and AI
## User Query
<|begin_of_query|>
{question}
<|end_of_query|>
## Response A
<|begin_of_response_A|>
{prediction}
<|end_of_response_A|>
## Response B
<|begin_of_response_B|>
{prediction2}
<|end_of_response_B|>
# Evaluation
## Checklist
<|begin_of_checklist|>
{checklist}
<|end_of_checklist|>
Please use this checklist to guide your evaluation, but do not limit your \
assessment to the checklist.
## Rules
You should compare the above two responses based on your analysis of the \
user queries and the conversation history.
You should first write down your analysis and the checklist that you used \
for the evaluation, and then provide your assessment according to the \
checklist.
There are five choices to give your final assessment: ["A++", "A+", \
"A=B", "B+", "B++"], which correspond to the following meanings:
- `A++`: Response A is much better than Response B.
- `A+`: Response A is only slightly better than Response B.
- `A=B`: Response A and B are of the same quality. Please use this \
choice sparingly.
- `B+`: Response B is only slightly better than Response A.
- `B++`: Response B is much better than Response A.
## Output Format
First, please output your analysis for each model response, and \
then summarize your assessment to three aspects: "reason A=B", \
"reason A>B", and "reason B>A", and finally make your choice for \
the final assessment.
Please provide your evaluation results in the following json \
format by filling in the placeholders in []:
{
"analysis of A": "[analysis of Response A]",
"analysis of B": "[analysis of Response B]",
"reason of A=B": "[where Response A and B perform equally well]",
"reason of A>B": "[where Response A is better than Response B]",
"reason of B>A": "[where Response B is better than Response A]",
"choice": "[A++ or A+ or A=B or B+ or B++]",
}
# 指令
您是一位专业评估专家。您的任务是评估两个AI模型生成回答的质量。
我们将为您提供用户问题及一对AI生成的回答回答A和回答B
您应当首先仔细阅读用户问题,然后根据以下提供的规则评估回答的质量。
# 用户与AI之间的对话
## 用户问题
<|begin_of_query|>
{question}
<|end_of_query|>
## 回答A
<|begin_of_response_A|>
{prediction}
<|end_of_response_A|>
## 回答B
<|begin_of_response_B|>
{prediction2}
<|end_of_response_B|>
# 评估
## 检查清单
<|begin_of_checklist|>
{checklist}
<|end_of_checklist|>
请参考此检查清单来评估回答的质量,但不要局限于此检查清单。
## 规则
您应当基于用户查询,分析比较上述两种回答。
您应当基于检查清单写下您的分析,然后提供您的评价。
有五个选项供您做出最终评估:["A++", "A+", "A=B", "B+", "B++"],它们对应如下含义:
- `A++`回答A远胜于回答B。
- `A+`回答A略优于回答B。
- `A=B`回答A和回答B质量相同。请谨慎使用此选项。
- `B+`回答B略优于回答A。
- `B++`回答B远胜于回答A。
## 输出格式
首先,请输出您对每个模型回答的分析,
然后总结您的评估到三个方面:"A=B的理由""A优于B的理由",和 "B优于A的理由"
最后做出您对最终评估的选择。
请按照以下json格式提供您的评估结果通过填充[]中的占位符:
{
"回答A的分析": "[回答A的分析]",
"回答B的分析": "[回答B的分析]",
"A=B的理由": "[A和B回答差不多的理由]",
"A优于B的理由": "[回答A优于B的理由]",
"B优于A的理由": "[回答B优于A的理由]",
"choice": "[A++ or A+ or A=B or B+ or B++]",
}
```

View File

@ -1,48 +1,48 @@
# CompassBench 2.0 介绍
## v1.0介绍
为支持OpenCompass的年度榜单本文将提供CompassBench的整体介绍。
本次评测将在语言、知识、创作、推理、数学、代码、长文本、智能体能力的多项任务上开展评测,现提供任务介绍和题目示例。
- 评测方式采样主观与客观相结合的方式,具体根据各个任务不同进行具体设计。
- 针对推理、数学、代码、智能体等任务将会采用Few-shot + CoT的评测方式。
- 对于填空题通过在Prompt中提供Few-shot和输出格式约束来协助抽取答案。
- 对于选择题,针对同一问题,通过变换提问方式,减少随机影响。
- 对于开放式问题的评测,对同一问题进行多次采样,并采用多维度打分的方式进行评价。
> OpenCompass在客观题评测的后处理上已进行较多优化并在评测时在Prompt中对回答格式进行约束对于因指令跟随问题带来的无法完成答案提取的情况将视为回答错误。OpenCompass将会在下一期加入指令跟随能力的评测。
| 能力 | 任务 | 介绍 | 题目示例 |
| ---- | ---- | ---- | ---- |
| 语言 | 信息抽取 | 信息抽取是指从文本中提取出特定类型的信息。这类任务通常用于处理结构化数据、知识图谱、问答系统等场景。 | ```"question": "野马队在分区轮以 2316 击败了匹兹堡钢人队,在比赛的最后三分钟拿下 11 分。然后他们在美式足球联合会 (AFC) 锦标赛上以 2018 击败了第 49 届超级碗卫冕冠军新英格兰爱国者队,在比赛还剩 17 秒 时拦截了新英格兰队的两分转换传球。尽管曼宁在本赛季的拦截上有问题,但他在两场季后赛中未投任何球。\n野马队在 AFC 锦标赛中打败了谁?"``` |
| 语言 | 意图识别 | 意图识别是对用户输入的文本或语音进行分析,判断其意图或需求。这类任务应用于智能客服、语音助手、聊天机器人等场景。 | ```"question": "中国文化的天人合一思想\n中西文化的基本差异之一就是在人与自然的关系问题上中国文化比较重视人与自然的和谐统一而西方文化则强调人要征服自然、改造自然才能求得自己的生存和发展。中国文化的这种特色有时通过“天人合一”的命题表述出来。中国古代思想家一般都反对把天与人割裂开来、对立起来而主张天人协调、天人合一。\n天人合一问题就其理论实质而言是关于人与自然的统一问题或者说是自然界和精神的统一问题。应当承认中国传统文化中的天人合一思想内容十分复杂其中既有正确的观点也有错误的观点我们必须实事求是地予以分析。但是从文化的民族性以及对民族文化的推进作用和深远影响看我们应当大胆肯定。中国古代思想家关于天人合一的思想其最基本的涵义就是充分肯定自然界和精神的统一关注人类行为与自然界的协调问题。从这个意思上说天人合一思想的是非常有价值的。\n恩格斯对自然和精神的统一问题有过一系列精辟的论述。他说“我们一天天地学会更加正确地理解自然规律学会认识我们对于自然界的惯常行程的干涉所引起的比较近或比较远的影响。”他还说“自然界和精神是统一的。自然界不能是无理性的……而理性是不能和自然界矛盾的。”“思维规律和自然规律只要它们被正确地认识必然是互相一致的。”恩格斯的这些论述深刻地揭示了自然和精神统一问题的丰富内涵。根据恩格斯的这些论述考察中国古代的天人合一思想不难看出这种思想有着深刻的合理性。\n中国古代的天人合一思想强调人与自然的统一人的行为与自然的协调道德理性与自然理性的一致充分显示了中国古代思想家对于主客体之间、主观能动性和客观规律之间关系的辩证思考。根据这种思想人不能违背自然规律不能超越自然界的承受力去改造自然、征服自然、破坏自然而只能在顺从自然规律的条件下去利用自然、调整自然使之更符合人类的需要也使自然界的万物都能生长发展。另一方面自然界也不是主宰人其社会的神秘力量而是可以认识、可以为我所用的客观对象。这种思想长期实践的结果是达到自然界与人的统一人的精神、行为与外在自然的统一自我身心平衡与自然环境平衡的统一以及由于这些统一而达到的天道与人道的统一从而实现完满和谐的精神追求。中国文化的天人合一思想对于解决当今世界由于工业化和无限制地征服自然而带来的自然环境被污染、生态平衡遭破坏等问题具有重要的启迪意义对于我们今天正在进行的社会主义现代化建设更有着防患于未然的重大现实意义。\n选自张岱年等主编的《中国文化概论》有删改\n根据原文提供的信息下列推断不正确的一项是","A": "对人与自然关系的认识,中国古代天人合一思想有优于西方文化的地方。","B": "现代人重视和研究天人合一思想,是基于对现实及发展问题的思考。", "C": "肯定天人合一思想的合理性,并不意味着对其思想内容的全盘接受。", "D": "以天人合一思想为指导,可解决当今世界因工业化带来的各种社会问题。",``` |
| 语言 | 情感分析 | 情感分析是对文本中的情感或情绪进行识别和分析的任务。这类任务可用于情感倾向分析场景。例如,分析社交媒体上的用户评论,了解新闻或事件的倾向。| ```"question": "请问以下评价是正面评价还是负面评价?\n大众点评网的霸王餐200份华辉拉肠双人试吃员村一店是已经有经营两年以上的年前装修过干净齐整下单的服务员亲切有礼可能我是第一个用代码验证的中间拖了点时间去验证幸好周日10点左右没有平时的多人。拉肠一如既往的滑皮蛋瘦肉粥很绵皮蛋瘦肉超多肉肠是一底带肉一底斋肠以前没吃过鸡蛋肠觉得6蚊不太划算现在发现是有三底肠粉的哦不太喜欢吃肉的可以试下很饱肚鼓油是吃过这么多家肠粉店味道调得最好的。","A": "正面评价", "B": "负面评价"```|
| 语言 | 内容总结 | 内容总结是将一篇较长的文本压缩成一篇简短的概括性摘要。这类任务适用于需要快速了解文档核心内容的情境,如新闻标题、电子邮件摘要 | ```联合国减灾办公室负责人格拉瑟。联合国减灾办公室2016年2月11日联合国减灾办公室今天表示,2015年是有记录以来最热的一个年份,在这一年当中,自然灾害影响了近1亿人口。减灾办公室呼吁各国采取行动,应对气候变化,在最大程度上做出努力,防止和减少灾害的发生。联合国减灾办公室所公布的最新数据显示,在过去一年当中,受到灾害影响最重的国家都在亚洲,它们是中国、印度、菲律宾和印度尼西亚。自然灾害共导致2万2000人死亡,带来的经济损失约合660亿美元。然而,尽管这一数字惊人,但却低于1400亿的10年平均数字。其中的部分原因是各国政府采取了更好的防范措施。数据显示,2015年有5000万人深受旱灾之苦,增幅达40%。联合国减灾办公室负责人格拉瑟表示,2015年是记载中最热的一个年份,成因是气候变化和厄尔尼诺天气现象。他指出,最令人感到不安的一个趋势是2015年有记录的主要干旱增加了一倍。他强调,数据表明,减少温室气体排放和适应气候变化对于减少灾害风险至关重要。```|
| 语言 | 内容评价 | 内容评价是对文本的质量、价值或观点进行判断和评价的任务。这类任务可用于评论筛选、观点挖掘等场景。 | ```"question": "以下是一个问题以及针对该问题的两个答案,哪个答案更好?\n问题创建一篇1000字的非剽窃新闻文章关于任天堂将于2月8日星期三播出新的任天堂直面会承诺将公布即将推出的Switch游戏的新细节。2月的任天堂直面会将在东部时间下午5点/太平洋时间下午2点在公司的YouTube频道上直播。\n\n任天堂表示星期三的任天堂直面会将持续“大约”40分钟并将重点放在即将在2023年上半年推出的Nintendo Switch游戏上。\n\n任天堂宣布的Nintendo Switch游戏阵容包括《星之卡比梦之泉豪华版》这是2011年Wii游戏的重制版《魔兵雅各樱花与失落的恶魔》这是一个以童话为灵感的冒险故事主角是年轻的魔兵雅各《塞尔达传说王国之泪》是《荒野之息》的续作以及《皮克敏4》这是一款长期运行的策略/动作系列的新作,仍然有些神秘。\n\n此外任天堂的计划中但日期尚未确定的还有《超级战争大战1+2重启营》这是两款Game Boy Advance时代的《超级战争》游戏的重制版。经过长时间的延迟Switch版的《超级战争》合集据传很快就要推出。当然任天堂在2023年还有其他一些大计划包括2月17日在好莱坞环球影城开设的超级任天堂乐园主题公园以及4月7日上映的《超级马里奥兄弟》电影。格斗游戏确实很酷但可能会让人望而生畏。你可能会看到铁拳或街霸这样的竞技世界然后想“我可能很菜而且没有正确的装备而且我不懂这些术语”但是停下来。你可以进入格斗游戏。对Pat Gill来说这段旅程真的很有意义。以下是他如何拥抱格斗游戏以及你也可以。\n答案 1任天堂粉丝们正急切地期待着最新的任天堂直面发布会该发布会定于2月8日星期三播出。该发布会将在公司的YouTube频道上以东部时间下午5点/太平洋时间下午2点的直播形式进行承诺提供有关任天堂Switch即将推出的游戏的令人兴奋的新细节。\n\n据任天堂称直面发布会将持续大约40分钟专门关注2023年上半年将为Switch发布的游戏。已经宣布的一些游戏标题包括《星之卡比梦之泉DX》、《魔兽传说樱花与失落的恶魔》、《塞尔达传说王国的泪滴》和《皮克敏4》。\n\n任天堂日程表上最受期待的游戏之一是《超级战争大战1+2重启训练营》这是两款经典Game Boy Advance游戏的重制版。尽管尚未正式宣布发行日期但有传言称游戏将在经过长时间延迟后很快推出。\n\n除了游戏阵容外任天堂还计划在2023年推出许多其他令人兴奋的项目包括2月17日在好莱坞环球影城开放的超级任天堂乐园主题公园以及4月7日上映的《超级马里奥兄弟》电影。\n\n对于格斗游戏爱好者来说任天堂直面发布会承诺将是一个令人兴奋的活动将提供任天堂Switch最新和最伟大游戏的一瞥。无论您是经验丰富的玩家还是刚刚开始直面发布会中肯定会有适合每个人的内容。因此请务必在2月8日星期三收看以获取来自任天堂的所有最新消息和更新\n答案 2任天堂宣布将于2月8日星期三下午5点EST/下午2点PST在公司的YouTube频道上播出新的任天堂Direct演示。演示将持续大约40分钟并将提供有关2023年上半年即将推出的任天堂Switch游戏的新细节。\n\n公司宣布的任天堂Switch阵容包括《星之卡比梦之返乡豪华版》这是2011年Wii游戏的重制版《魔兽师起源赛瑞莎与失落的恶魔》一个以童话为灵感的冒险主角是年轻的贝约奈塔《塞尔达传说王国之泪》《荒野之息》的续作以及《皮克敏4》这是一款长寿的策略/动作系列的新作,仍然有些神秘。\n\n此外任天堂还有一些其他大计划包括在2023年2月17日在好莱坞环球影城开设超级任天堂乐园主题公园以及于4月7日上映《超级马里奥兄弟电影》。\n\n格斗游戏是一种受欢迎的游戏类型可能是一种令人望而生畏的爱好。然而人们是可以享受格斗游戏的Pat Gill就是如何拥抱这种爱好的一个很好的例子。他从一个初学者开始发现这是一段有意义的旅程。只要有正确的心态和资源任何人都可以参与格斗游戏并享受它们所提供的刺激和竞争。" ``` |
| 语言 | 多语言翻译 | 多语言翻译是将一种语言的文本转换为另一种语言的文本。这类任务适用于跨语言沟通、在线翻译等场景。|```"question": "Translate the following sentence from English to French: \"He [Wales] basically lied to us from the start. First, by acting as if this was for legal reasons. Second, by pretending he was listening to us, right up to his art deletion."```|
| 语言 | 中华传统文化理解 | 中华传统文化涉及对中国古代文学、艺术、哲学、历史等领域的研究 | ``` "question": "王实甫在《西厢记》中写道:“淋漓襟袖啼红泪,比司马青衫更湿”,其中“司马青衫”指的是什么"``` |
| 语言 | 中文语意理解 | 中文语意理解涉及理解文本中的词汇、短语和句子之间的语义关系,包括但不限于近义词、反义词、整体-部分关系、修饰关系等。 |``` "question": "“繁荣”与以下哪个词具有近义关系?", "A": "盛世", "B": "荣誉", "C": "繁花", "D": "昌盛"```|
| 语言 | 多轮对话 | 评价模型能否在多轮对话中保持上下文一致性和连贯性的能力,评估模型是否能够理解并记住对话的上下文信息,记住之前的对话内容。 |```[{'role': 'user','content': '我在做一项关于智能手机市场的研究,需要整理一些数据成 Markdown 表格。数据包括品牌名称、市场份额和热销型号。品牌有苹果、三星和华为。苹果的市场份额是30%热销型号是iPhone 13三星市场份额是25%热销型号是Galaxy S21华为市场份额是20%热销型号是Mate 40。请帮我做一个表格。'},{'role': 'user','content': '看起来不错不过我希望表格中的市场份额列展示为百分比和实际销量。苹果的销量是8000万部三星是6000万部华为是5000万部。'}, {'role': 'user', 'content': '很好。现在请把表格的标题中文改成英文,并且各列改成对齐方式:品牌列左对齐,市场份额列居中对齐,热销型号列右对齐。'},{'role': 'user', 'content': '可以我注意到我们可能需要添加一列来表示这些品牌的总收入苹果为500亿美元三星为400亿美元华为为350亿美元。此外请按市场销量对行进行排序。'}]```|
| 知识 | 生活常识 | 考察普通社会上智力正常的人皆有或普遍拥有的,大众化的知识 | ```"question": "世界四大文明古国有哪些?```|
| 知识 | 自然科学(理科) | 关于自然现象的具体科学,研究自然界的本质和规律(理科):包括不限于数学,物理学,化学,生物学,天文学等 | ```"question": "群的研究对象是什么?"``` |
| 知识 | 自然科学(工科) | 关于自然现象的具体科学,研究自然界的本质和规律(工科):包括不限于计算机科学,医学,建筑学,材料学,机械学,测量学,气象学,环境学等 | ```"question": "下列关于信息安全的说法,正确的是( )。", "options": ["打开朋友转发的网页链接一定是安全的", "安装了杀毒软件后电脑就不会感染病毒", "数据加密是一种提高信息安全性的有效措施", "手机指纹识别技术能确保手机所有信息的安全"]``` |
| 知识 | 社会科学 | 研究社会现象的具体科学,力求揭示社会的本质和规律,例如经济学,政治学,军事学,社会学,管理学,教育学等。社会科学主要以人类社会的组织与结构、体制与关系、功能与效率、秩序与规范为研究认识之对象,并通过这种知识来为人类社会的有序管理、高效运作提供知识、理论和手段 | ```"question": "为了避免资金供应短缺和倒闭,企业经营者需要做什么?"``` |
| 知识 | 人文科学 | 设设计对人的问题的类型思考与情感体验,围绕着关乎人的心灵世界、关乎人的精神生命主题而展开的种种思想、观念、知识和理论的探索。它以人类自身,特别是人的内心情感世界为研究中心,以人自身的发展和完善作为学术探索的出发点和归宿。包括不限于文学,历史学、哲学、艺术、语言等 | ```"question": "光绪二十四年(1898)五月,维新派代表人物康有为从“中体西用”的角度论述了科举制度改革的必要性。这表明他( )", "options": ["在戊戌变法初期思想趋于保守", "认同洋务派的“中体西用”思想", "在教育改革方面与洋务派观点一致", "所说的“体”和“用”与洋务派不同"]``` |
| 创作 | 内容扩写 | 给定标题或者大纲的基础上,通过增加细节、描述和解释,使内容更加丰富、饱满和具有表现力。这种方法主要用于散文、小说等文学创作,以及学术论文、报告等实用文本 | ```请根据我给出的[外星人入侵、核弹、流亡]这些关键词来撰写一篇[科幻]题材的短篇故事。 \n故事需要拥有[引人入胜]的开头以及[反转]的结局,故事线[跌宕起伏]。\n注意请使用[刘慈欣]的写作风格为我撰写这篇故事。减少赘述内容中不要有重复或意思相近的段落大约800字``` |
| 创作 | 内容续写 | 现有文本的基础上,继续编写后面的内容。这种方法主要用于小说、故事等叙事性文本。续写部分通常要保持与原有文本的风格、情节和人物设定相一致,同时要求作者具备较强的想象力和创造力。 | ```题目《新型能源技术在工业生产中的应用与效益》随着能源需求的不断增长和传统能源的有限性新型能源技术在工业领域的应用备受瞩目。本文将着重探讨新型能源技术对工业生产的潜在影响以及其在提高生产效益和减少环境影响方面的作用。请按照以上题目和摘要完成一篇不少于1000字的论文``` |
| 创作 | 内容改写 | 不改变原文主题和基本结构的前提下,对文本进行一定程度的修改、重组和优化。这种方法主要用于修改学术论文、报告、文章等。内容改写的目的是提高文本的表达能力、逻辑性和可读性,同时避免重复。 | ```请帮我总结一封电子邮件的内容,总结需要包含以下四个部分:\n【重要性】根据内容判断事项是否重要结果包含重要、不重要\n【紧急性】根据内容判断事项是否紧急结果包含紧急、不紧急\n【核心内容】使用一句简短的话总结邮件最核心的内容。\n【需要回复内容】请判断邮件中哪些内容需要获得我的回复/确认,以列表形式呈现。\n 接下来,请根据下面邮件的内容,进行摘要:\n亲爱的全体员工:\n为了改善大家的身心健康,增强工作效率,公司特别安排了一场瑜伽兴趣培训,现将培训内容通知如下:\n日期及时间:8月15日(周六)上午9:00至11:00\n地点:公司三楼活动室(面积120平米,可容纳30人参加培训)\n培训内容:\n专业瑜伽教练将为大家进行基础的瑜伽技能和健康知识培训。 瑜伽是一种低强度有氧运动,适合各年龄层人群。它能够通过姿势练习、呼吸技巧等,改善身体的柔韧性和平衡感,帮助人体各系统更好地运行,有效减压提神。\n本次培训重点讲解:\n1)基本的瑜伽哲学及其健康效果介绍\n2)冥想和呼吸技巧演练\n3)10多个常见的基础瑜伽姿势示范及练习(包括猿人式、波浪式、斜 Supported Headstand 等)\n4)瑜伽练习时需要注意的安全事项\n5)瑜伽适宜穿戴的服装和个人物品\n6)参与培训后如何延续瑜伽运动\n培训具体流程:\n9:00-9:30 瑜伽基本概念介绍\n9:30-10:10 练习冥想、呼吸及基础姿势\n10:10-10:30 小休10分钟\n10:30-11:00 继续练习高难度姿势并解答问题\n如有意参加本次瑜伽兴趣培训,请于8月10日前用邮件或电话方式告知我们,我方将安排培训。\n若您有任何问题或建议,也欢迎与我联系。感谢您的收听与参与。```|
| 推理 | 逻辑推理 | 综合考察模型的几种常见逻辑推理模式:如演绎、归纳和溯因。 | ```"question": "在接下来的文本中,符号 -> 代表着一个简单的数学运算。\n695 - 472 -> 229\n222 - 62 -> 166\n689 - 439 -> ?",```|
| 推理 | 常识推理 | 常识推理是指基于日常生活中积累的知识和经验,对事物进行合理推断和判断的过程。它涉及到对常见事物、现象和规律的理解,通过综合分析得出合理的结论。 | ```"question": "美即好效应,指对一个外表英俊漂亮的人,人们很容易误认为他或她的其他方面也很不错。根据上述定义,下列哪项属于美即好效应?( ", "A": "外表英俊漂亮的人在应聘中更受招聘者的青睐", "B": "小芳认为自己的女儿是幼儿园中最漂亮的孩子", "C": "人们常说女孩因为可爱而美丽并非因为美丽而可爱", "D": "购物网站上有一个漂亮的模特往往会提高产品的销量"``` |
| 数学 | 初等数学 | 初等教育数学能力(小学数学) | ```"question": "小芳手上有40元。她的爸爸又给了她100元。她花了30元买了一条牛仔裤又花了20元买了一个包。那么小芳还剩下多少钱呢"```|
| 数学 | 中等数学 | 中等教育数学能力(初中和高中数学) | ```"question": "某地开展建设绿色家园活动,活动期间,计划每天种植相同数量的树木.该活动开始后,实际每天比原计划每天多植树$50$棵,实际植树$400$棵所需时间与原计划植树$300$棵所需时间相同.设实际每天植树$x$棵,则下列方程正确的是( ", "options": ["$\\frac{{400}}{{x-50}}=\\frac{{300}}{x}$", "$\\frac{{300}}{{x-50}}=\\frac{{400}}{x}$", "$\\frac{{400}}{{x+50}}=\\frac{{300}}{x}$", "$\\frac{{300}}{{x+50}}=\\frac{{400}}{x}$"]```|
| 数学 | 高等 | 高教育数学能力(大学和研究生数学) | ```"question": "已知有向曲线 $L$ 为球面 $x^2+y^2+z^2=2x$ 与平面 $2x-z-1=0$ 的交线,从 $z$ 轴正向往 $z$ 轴负向看去为逆时针方向,计算曲线积分$\\int_L(6xyz-yz^2)dx+2x^2zdy+xyzdz$.", "options": [ "$\\frac{4\\pi}{7\\sqrt5}$", "$\\frac{3\\pi}{7\\sqrt5}$", "$\\frac{3\\pi}{5\\sqrt5}$", "$\\frac{4\\pi}{5\\sqrt5}$"]``` |
| 代码 | 代码理解 | 输入为用户的需求文字或者部分代码,考察模型的逻辑推理能力和代码生成能力,考察模型对各类编程语言的掌握程度。内容包括不限于:算法和数据结构能力考察编程语言语法考察跨编程语言转换 | ```"question": "编写一个 Python 函数,用于检查两个数字是否仅在一个位置上不同。"```|
| 代码 | 代码分析 | 考察模型对代码的理解和分析能力,给定一段代码,进行代码意图分析,代码规范检查,错误检查等 | ```"question":"\n\ndef truncate_number(number: float) -> float:\n \"\"\" 给定一个正的浮点数可以将其分解为整数部分小于给定数字的最大整数和小数部分余数部分总是小于1。\n\n 返回该数字的小数部分。\n >>> truncate_number(3.5)\n 0.5\n \"\"\"",``` |
| 长文本 | 长文本理解与推理 | 考察模型在不同的长度上下文2k, 4k, 8k, 16k, 32k情况下的理解和推理能力 | 略 |
| 智能体 | 任务规划 | 智能体根据用户的需求目标和具备工具条件,进行合理的任务拆解,科学地安排子任务的执行顺序和策略,对任务执行路径进行设计和规划,选择合适的策略。 | 略|
| 智能体 | 工具调用 | 评估模型能否准确的调用合适的API在调用API时能否正确的传递参数 | 略 |
| 智能体 | 反思能力 | 评估模型在子任务执行失败时,是否具有反思和重新规划任务路径的能力 | 略 |
| 智能体 | 任务执行总结 | 评估模型能否根据子任务的执行结果进行总结分析,完成原始任务目标,正确地按指令输出回复 | 略|
| 智能体 | 多轮交互 | 评估模型在进行多轮复杂工具调用时的能力,在多轮情况下能否准确理解意图 | 略 |
# CompassBench 2.0 介绍
## v1.0介绍
为支持OpenCompass的年度榜单本文将提供CompassBench的整体介绍。
本次评测将在语言、知识、创作、推理、数学、代码、长文本、智能体能力的多项任务上开展评测,现提供任务介绍和题目示例。
- 评测方式采样主观与客观相结合的方式,具体根据各个任务不同进行具体设计。
- 针对推理、数学、代码、智能体等任务将会采用Few-shot + CoT的评测方式。
- 对于填空题通过在Prompt中提供Few-shot和输出格式约束来协助抽取答案。
- 对于选择题,针对同一问题,通过变换提问方式,减少随机影响。
- 对于开放式问题的评测,对同一问题进行多次采样,并采用多维度打分的方式进行评价。
> OpenCompass在客观题评测的后处理上已进行较多优化并在评测时在Prompt中对回答格式进行约束对于因指令跟随问题带来的无法完成答案提取的情况将视为回答错误。OpenCompass将会在下一期加入指令跟随能力的评测。
| 能力 | 任务 | 介绍 | 题目示例 |
| ---- | ---- | ---- | ---- |
| 语言 | 信息抽取 | 信息抽取是指从文本中提取出特定类型的信息。这类任务通常用于处理结构化数据、知识图谱、问答系统等场景。 | ```"question": "野马队在分区轮以 2316 击败了匹兹堡钢人队,在比赛的最后三分钟拿下 11 分。然后他们在美式足球联合会 (AFC) 锦标赛上以 2018 击败了第 49 届超级碗卫冕冠军新英格兰爱国者队,在比赛还剩 17 秒 时拦截了新英格兰队的两分转换传球。尽管曼宁在本赛季的拦截上有问题,但他在两场季后赛中未投任何球。\n野马队在 AFC 锦标赛中打败了谁?"``` |
| 语言 | 意图识别 | 意图识别是对用户输入的文本或语音进行分析,判断其意图或需求。这类任务应用于智能客服、语音助手、聊天机器人等场景。 | ```"question": "中国文化的天人合一思想\n中西文化的基本差异之一就是在人与自然的关系问题上中国文化比较重视人与自然的和谐统一而西方文化则强调人要征服自然、改造自然才能求得自己的生存和发展。中国文化的这种特色有时通过“天人合一”的命题表述出来。中国古代思想家一般都反对把天与人割裂开来、对立起来而主张天人协调、天人合一。\n天人合一问题就其理论实质而言是关于人与自然的统一问题或者说是自然界和精神的统一问题。应当承认中国传统文化中的天人合一思想内容十分复杂其中既有正确的观点也有错误的观点我们必须实事求是地予以分析。但是从文化的民族性以及对民族文化的推进作用和深远影响看我们应当大胆肯定。中国古代思想家关于天人合一的思想其最基本的涵义就是充分肯定自然界和精神的统一关注人类行为与自然界的协调问题。从这个意思上说天人合一思想的是非常有价值的。\n恩格斯对自然和精神的统一问题有过一系列精辟的论述。他说“我们一天天地学会更加正确地理解自然规律学会认识我们对于自然界的惯常行程的干涉所引起的比较近或比较远的影响。”他还说“自然界和精神是统一的。自然界不能是无理性的……而理性是不能和自然界矛盾的。”“思维规律和自然规律只要它们被正确地认识必然是互相一致的。”恩格斯的这些论述深刻地揭示了自然和精神统一问题的丰富内涵。根据恩格斯的这些论述考察中国古代的天人合一思想不难看出这种思想有着深刻的合理性。\n中国古代的天人合一思想强调人与自然的统一人的行为与自然的协调道德理性与自然理性的一致充分显示了中国古代思想家对于主客体之间、主观能动性和客观规律之间关系的辩证思考。根据这种思想人不能违背自然规律不能超越自然界的承受力去改造自然、征服自然、破坏自然而只能在顺从自然规律的条件下去利用自然、调整自然使之更符合人类的需要也使自然界的万物都能生长发展。另一方面自然界也不是主宰人其社会的神秘力量而是可以认识、可以为我所用的客观对象。这种思想长期实践的结果是达到自然界与人的统一人的精神、行为与外在自然的统一自我身心平衡与自然环境平衡的统一以及由于这些统一而达到的天道与人道的统一从而实现完满和谐的精神追求。中国文化的天人合一思想对于解决当今世界由于工业化和无限制地征服自然而带来的自然环境被污染、生态平衡遭破坏等问题具有重要的启迪意义对于我们今天正在进行的社会主义现代化建设更有着防患于未然的重大现实意义。\n选自张岱年等主编的《中国文化概论》有删改\n根据原文提供的信息下列推断不正确的一项是","A": "对人与自然关系的认识,中国古代天人合一思想有优于西方文化的地方。","B": "现代人重视和研究天人合一思想,是基于对现实及发展问题的思考。", "C": "肯定天人合一思想的合理性,并不意味着对其思想内容的全盘接受。", "D": "以天人合一思想为指导,可解决当今世界因工业化带来的各种社会问题。",``` |
| 语言 | 情感分析 | 情感分析是对文本中的情感或情绪进行识别和分析的任务。这类任务可用于情感倾向分析场景。例如,分析社交媒体上的用户评论,了解新闻或事件的倾向。| ```"question": "请问以下评价是正面评价还是负面评价?\n大众点评网的霸王餐200份华辉拉肠双人试吃员村一店是已经有经营两年以上的年前装修过干净齐整下单的服务员亲切有礼可能我是第一个用代码验证的中间拖了点时间去验证幸好周日10点左右没有平时的多人。拉肠一如既往的滑皮蛋瘦肉粥很绵皮蛋瘦肉超多肉肠是一底带肉一底斋肠以前没吃过鸡蛋肠觉得6蚊不太划算现在发现是有三底肠粉的哦不太喜欢吃肉的可以试下很饱肚鼓油是吃过这么多家肠粉店味道调得最好的。","A": "正面评价", "B": "负面评价"```|
| 语言 | 内容总结 | 内容总结是将一篇较长的文本压缩成一篇简短的概括性摘要。这类任务适用于需要快速了解文档核心内容的情境,如新闻标题、电子邮件摘要 | ```联合国减灾办公室负责人格拉瑟。联合国减灾办公室2016年2月11日联合国减灾办公室今天表示,2015年是有记录以来最热的一个年份,在这一年当中,自然灾害影响了近1亿人口。减灾办公室呼吁各国采取行动,应对气候变化,在最大程度上做出努力,防止和减少灾害的发生。联合国减灾办公室所公布的最新数据显示,在过去一年当中,受到灾害影响最重的国家都在亚洲,它们是中国、印度、菲律宾和印度尼西亚。自然灾害共导致2万2000人死亡,带来的经济损失约合660亿美元。然而,尽管这一数字惊人,但却低于1400亿的10年平均数字。其中的部分原因是各国政府采取了更好的防范措施。数据显示,2015年有5000万人深受旱灾之苦,增幅达40%。联合国减灾办公室负责人格拉瑟表示,2015年是记载中最热的一个年份,成因是气候变化和厄尔尼诺天气现象。他指出,最令人感到不安的一个趋势是2015年有记录的主要干旱增加了一倍。他强调,数据表明,减少温室气体排放和适应气候变化对于减少灾害风险至关重要。```|
| 语言 | 内容评价 | 内容评价是对文本的质量、价值或观点进行判断和评价的任务。这类任务可用于评论筛选、观点挖掘等场景。 | ```"question": "以下是一个问题以及针对该问题的两个答案,哪个答案更好?\n问题创建一篇1000字的非剽窃新闻文章关于任天堂将于2月8日星期三播出新的任天堂直面会承诺将公布即将推出的Switch游戏的新细节。2月的任天堂直面会将在东部时间下午5点/太平洋时间下午2点在公司的YouTube频道上直播。\n\n任天堂表示星期三的任天堂直面会将持续“大约”40分钟并将重点放在即将在2023年上半年推出的Nintendo Switch游戏上。\n\n任天堂宣布的Nintendo Switch游戏阵容包括《星之卡比梦之泉豪华版》这是2011年Wii游戏的重制版《魔兵雅各樱花与失落的恶魔》这是一个以童话为灵感的冒险故事主角是年轻的魔兵雅各《塞尔达传说王国之泪》是《荒野之息》的续作以及《皮克敏4》这是一款长期运行的策略/动作系列的新作,仍然有些神秘。\n\n此外任天堂的计划中但日期尚未确定的还有《超级战争大战1+2重启营》这是两款Game Boy Advance时代的《超级战争》游戏的重制版。经过长时间的延迟Switch版的《超级战争》合集据传很快就要推出。当然任天堂在2023年还有其他一些大计划包括2月17日在好莱坞环球影城开设的超级任天堂乐园主题公园以及4月7日上映的《超级马里奥兄弟》电影。格斗游戏确实很酷但可能会让人望而生畏。你可能会看到铁拳或街霸这样的竞技世界然后想“我可能很菜而且没有正确的装备而且我不懂这些术语”但是停下来。你可以进入格斗游戏。对Pat Gill来说这段旅程真的很有意义。以下是他如何拥抱格斗游戏以及你也可以。\n答案 1任天堂粉丝们正急切地期待着最新的任天堂直面发布会该发布会定于2月8日星期三播出。该发布会将在公司的YouTube频道上以东部时间下午5点/太平洋时间下午2点的直播形式进行承诺提供有关任天堂Switch即将推出的游戏的令人兴奋的新细节。\n\n据任天堂称直面发布会将持续大约40分钟专门关注2023年上半年将为Switch发布的游戏。已经宣布的一些游戏标题包括《星之卡比梦之泉DX》、《魔兽传说樱花与失落的恶魔》、《塞尔达传说王国的泪滴》和《皮克敏4》。\n\n任天堂日程表上最受期待的游戏之一是《超级战争大战1+2重启训练营》这是两款经典Game Boy Advance游戏的重制版。尽管尚未正式宣布发行日期但有传言称游戏将在经过长时间延迟后很快推出。\n\n除了游戏阵容外任天堂还计划在2023年推出许多其他令人兴奋的项目包括2月17日在好莱坞环球影城开放的超级任天堂乐园主题公园以及4月7日上映的《超级马里奥兄弟》电影。\n\n对于格斗游戏爱好者来说任天堂直面发布会承诺将是一个令人兴奋的活动将提供任天堂Switch最新和最伟大游戏的一瞥。无论您是经验丰富的玩家还是刚刚开始直面发布会中肯定会有适合每个人的内容。因此请务必在2月8日星期三收看以获取来自任天堂的所有最新消息和更新\n答案 2任天堂宣布将于2月8日星期三下午5点EST/下午2点PST在公司的YouTube频道上播出新的任天堂Direct演示。演示将持续大约40分钟并将提供有关2023年上半年即将推出的任天堂Switch游戏的新细节。\n\n公司宣布的任天堂Switch阵容包括《星之卡比梦之返乡豪华版》这是2011年Wii游戏的重制版《魔兽师起源赛瑞莎与失落的恶魔》一个以童话为灵感的冒险主角是年轻的贝约奈塔《塞尔达传说王国之泪》《荒野之息》的续作以及《皮克敏4》这是一款长寿的策略/动作系列的新作,仍然有些神秘。\n\n此外任天堂还有一些其他大计划包括在2023年2月17日在好莱坞环球影城开设超级任天堂乐园主题公园以及于4月7日上映《超级马里奥兄弟电影》。\n\n格斗游戏是一种受欢迎的游戏类型可能是一种令人望而生畏的爱好。然而人们是可以享受格斗游戏的Pat Gill就是如何拥抱这种爱好的一个很好的例子。他从一个初学者开始发现这是一段有意义的旅程。只要有正确的心态和资源任何人都可以参与格斗游戏并享受它们所提供的刺激和竞争。" ``` |
| 语言 | 多语言翻译 | 多语言翻译是将一种语言的文本转换为另一种语言的文本。这类任务适用于跨语言沟通、在线翻译等场景。|```"question": "Translate the following sentence from English to French: \"He [Wales] basically lied to us from the start. First, by acting as if this was for legal reasons. Second, by pretending he was listening to us, right up to his art deletion."```|
| 语言 | 中华传统文化理解 | 中华传统文化涉及对中国古代文学、艺术、哲学、历史等领域的研究 | ``` "question": "王实甫在《西厢记》中写道:“淋漓襟袖啼红泪,比司马青衫更湿”,其中“司马青衫”指的是什么"``` |
| 语言 | 中文语意理解 | 中文语意理解涉及理解文本中的词汇、短语和句子之间的语义关系,包括但不限于近义词、反义词、整体-部分关系、修饰关系等。 |``` "question": "“繁荣”与以下哪个词具有近义关系?", "A": "盛世", "B": "荣誉", "C": "繁花", "D": "昌盛"```|
| 语言 | 多轮对话 | 评价模型能否在多轮对话中保持上下文一致性和连贯性的能力,评估模型是否能够理解并记住对话的上下文信息,记住之前的对话内容。 |```[{'role': 'user','content': '我在做一项关于智能手机市场的研究,需要整理一些数据成 Markdown 表格。数据包括品牌名称、市场份额和热销型号。品牌有苹果、三星和华为。苹果的市场份额是30%热销型号是iPhone 13三星市场份额是25%热销型号是Galaxy S21华为市场份额是20%热销型号是Mate 40。请帮我做一个表格。'},{'role': 'user','content': '看起来不错不过我希望表格中的市场份额列展示为百分比和实际销量。苹果的销量是8000万部三星是6000万部华为是5000万部。'}, {'role': 'user', 'content': '很好。现在请把表格的标题中文改成英文,并且各列改成对齐方式:品牌列左对齐,市场份额列居中对齐,热销型号列右对齐。'},{'role': 'user', 'content': '可以我注意到我们可能需要添加一列来表示这些品牌的总收入苹果为500亿美元三星为400亿美元华为为350亿美元。此外请按市场销量对行进行排序。'}]```|
| 知识 | 生活常识 | 考察普通社会上智力正常的人皆有或普遍拥有的,大众化的知识 | ```"question": "世界四大文明古国有哪些?```|
| 知识 | 自然科学(理科) | 关于自然现象的具体科学,研究自然界的本质和规律(理科):包括不限于数学,物理学,化学,生物学,天文学等 | ```"question": "群的研究对象是什么?"``` |
| 知识 | 自然科学(工科) | 关于自然现象的具体科学,研究自然界的本质和规律(工科):包括不限于计算机科学,医学,建筑学,材料学,机械学,测量学,气象学,环境学等 | ```"question": "下列关于信息安全的说法,正确的是( )。", "options": ["打开朋友转发的网页链接一定是安全的", "安装了杀毒软件后电脑就不会感染病毒", "数据加密是一种提高信息安全性的有效措施", "手机指纹识别技术能确保手机所有信息的安全"]``` |
| 知识 | 社会科学 | 研究社会现象的具体科学,力求揭示社会的本质和规律,例如经济学,政治学,军事学,社会学,管理学,教育学等。社会科学主要以人类社会的组织与结构、体制与关系、功能与效率、秩序与规范为研究认识之对象,并通过这种知识来为人类社会的有序管理、高效运作提供知识、理论和手段 | ```"question": "为了避免资金供应短缺和倒闭,企业经营者需要做什么?"``` |
| 知识 | 人文科学 | 设设计对人的问题的类型思考与情感体验,围绕着关乎人的心灵世界、关乎人的精神生命主题而展开的种种思想、观念、知识和理论的探索。它以人类自身,特别是人的内心情感世界为研究中心,以人自身的发展和完善作为学术探索的出发点和归宿。包括不限于文学,历史学、哲学、艺术、语言等 | ```"question": "光绪二十四年(1898)五月,维新派代表人物康有为从“中体西用”的角度论述了科举制度改革的必要性。这表明他( )", "options": ["在戊戌变法初期思想趋于保守", "认同洋务派的“中体西用”思想", "在教育改革方面与洋务派观点一致", "所说的“体”和“用”与洋务派不同"]``` |
| 创作 | 内容扩写 | 给定标题或者大纲的基础上,通过增加细节、描述和解释,使内容更加丰富、饱满和具有表现力。这种方法主要用于散文、小说等文学创作,以及学术论文、报告等实用文本 | ```请根据我给出的[外星人入侵、核弹、流亡]这些关键词来撰写一篇[科幻]题材的短篇故事。 \n故事需要拥有[引人入胜]的开头以及[反转]的结局,故事线[跌宕起伏]。\n注意请使用[刘慈欣]的写作风格为我撰写这篇故事。减少赘述内容中不要有重复或意思相近的段落大约800字``` |
| 创作 | 内容续写 | 现有文本的基础上,继续编写后面的内容。这种方法主要用于小说、故事等叙事性文本。续写部分通常要保持与原有文本的风格、情节和人物设定相一致,同时要求作者具备较强的想象力和创造力。 | ```题目《新型能源技术在工业生产中的应用与效益》随着能源需求的不断增长和传统能源的有限性新型能源技术在工业领域的应用备受瞩目。本文将着重探讨新型能源技术对工业生产的潜在影响以及其在提高生产效益和减少环境影响方面的作用。请按照以上题目和摘要完成一篇不少于1000字的论文``` |
| 创作 | 内容改写 | 不改变原文主题和基本结构的前提下,对文本进行一定程度的修改、重组和优化。这种方法主要用于修改学术论文、报告、文章等。内容改写的目的是提高文本的表达能力、逻辑性和可读性,同时避免重复。 | ```请帮我总结一封电子邮件的内容,总结需要包含以下四个部分:\n【重要性】根据内容判断事项是否重要结果包含重要、不重要\n【紧急性】根据内容判断事项是否紧急结果包含紧急、不紧急\n【核心内容】使用一句简短的话总结邮件最核心的内容。\n【需要回复内容】请判断邮件中哪些内容需要获得我的回复/确认,以列表形式呈现。\n 接下来,请根据下面邮件的内容,进行摘要:\n亲爱的全体员工:\n为了改善大家的身心健康,增强工作效率,公司特别安排了一场瑜伽兴趣培训,现将培训内容通知如下:\n日期及时间:8月15日(周六)上午9:00至11:00\n地点:公司三楼活动室(面积120平米,可容纳30人参加培训)\n培训内容:\n专业瑜伽教练将为大家进行基础的瑜伽技能和健康知识培训。 瑜伽是一种低强度有氧运动,适合各年龄层人群。它能够通过姿势练习、呼吸技巧等,改善身体的柔韧性和平衡感,帮助人体各系统更好地运行,有效减压提神。\n本次培训重点讲解:\n1)基本的瑜伽哲学及其健康效果介绍\n2)冥想和呼吸技巧演练\n3)10多个常见的基础瑜伽姿势示范及练习(包括猿人式、波浪式、斜 Supported Headstand 等)\n4)瑜伽练习时需要注意的安全事项\n5)瑜伽适宜穿戴的服装和个人物品\n6)参与培训后如何延续瑜伽运动\n培训具体流程:\n9:00-9:30 瑜伽基本概念介绍\n9:30-10:10 练习冥想、呼吸及基础姿势\n10:10-10:30 小休10分钟\n10:30-11:00 继续练习高难度姿势并解答问题\n如有意参加本次瑜伽兴趣培训,请于8月10日前用邮件或电话方式告知我们,我方将安排培训。\n若您有任何问题或建议,也欢迎与我联系。感谢您的收听与参与。```|
| 推理 | 逻辑推理 | 综合考察模型的几种常见逻辑推理模式:如演绎、归纳和溯因。 | ```"question": "在接下来的文本中,符号 -> 代表着一个简单的数学运算。\n695 - 472 -> 229\n222 - 62 -> 166\n689 - 439 -> ?",```|
| 推理 | 常识推理 | 常识推理是指基于日常生活中积累的知识和经验,对事物进行合理推断和判断的过程。它涉及到对常见事物、现象和规律的理解,通过综合分析得出合理的结论。 | ```"question": "美即好效应,指对一个外表英俊漂亮的人,人们很容易误认为他或她的其他方面也很不错。根据上述定义,下列哪项属于美即好效应?( ", "A": "外表英俊漂亮的人在应聘中更受招聘者的青睐", "B": "小芳认为自己的女儿是幼儿园中最漂亮的孩子", "C": "人们常说女孩因为可爱而美丽并非因为美丽而可爱", "D": "购物网站上有一个漂亮的模特往往会提高产品的销量"``` |
| 数学 | 初等数学 | 初等教育数学能力(小学数学) | ```"question": "小芳手上有40元。她的爸爸又给了她100元。她花了30元买了一条牛仔裤又花了20元买了一个包。那么小芳还剩下多少钱呢"```|
| 数学 | 中等数学 | 中等教育数学能力(初中和高中数学) | ```"question": "某地开展建设绿色家园活动,活动期间,计划每天种植相同数量的树木.该活动开始后,实际每天比原计划每天多植树$50$棵,实际植树$400$棵所需时间与原计划植树$300$棵所需时间相同.设实际每天植树$x$棵,则下列方程正确的是( ", "options": ["$\\frac{{400}}{{x-50}}=\\frac{{300}}{x}$", "$\\frac{{300}}{{x-50}}=\\frac{{400}}{x}$", "$\\frac{{400}}{{x+50}}=\\frac{{300}}{x}$", "$\\frac{{300}}{{x+50}}=\\frac{{400}}{x}$"]```|
| 数学 | 高等 | 高教育数学能力(大学和研究生数学) | ```"question": "已知有向曲线 $L$ 为球面 $x^2+y^2+z^2=2x$ 与平面 $2x-z-1=0$ 的交线,从 $z$ 轴正向往 $z$ 轴负向看去为逆时针方向,计算曲线积分$\\int_L(6xyz-yz^2)dx+2x^2zdy+xyzdz$.", "options": [ "$\\frac{4\\pi}{7\\sqrt5}$", "$\\frac{3\\pi}{7\\sqrt5}$", "$\\frac{3\\pi}{5\\sqrt5}$", "$\\frac{4\\pi}{5\\sqrt5}$"]``` |
| 代码 | 代码理解 | 输入为用户的需求文字或者部分代码,考察模型的逻辑推理能力和代码生成能力,考察模型对各类编程语言的掌握程度。内容包括不限于:算法和数据结构能力考察编程语言语法考察跨编程语言转换 | ```"question": "编写一个 Python 函数,用于检查两个数字是否仅在一个位置上不同。"```|
| 代码 | 代码分析 | 考察模型对代码的理解和分析能力,给定一段代码,进行代码意图分析,代码规范检查,错误检查等 | ```"question":"\n\ndef truncate_number(number: float) -> float:\n \"\"\" 给定一个正的浮点数可以将其分解为整数部分小于给定数字的最大整数和小数部分余数部分总是小于1。\n\n 返回该数字的小数部分。\n >>> truncate_number(3.5)\n 0.5\n \"\"\"",``` |
| 长文本 | 长文本理解与推理 | 考察模型在不同的长度上下文2k, 4k, 8k, 16k, 32k情况下的理解和推理能力 | 略 |
| 智能体 | 任务规划 | 智能体根据用户的需求目标和具备工具条件,进行合理的任务拆解,科学地安排子任务的执行顺序和策略,对任务执行路径进行设计和规划,选择合适的策略。 | 略|
| 智能体 | 工具调用 | 评估模型能否准确的调用合适的API在调用API时能否正确的传递参数 | 略 |
| 智能体 | 反思能力 | 评估模型在子任务执行失败时,是否具有反思和重新规划任务路径的能力 | 略 |
| 智能体 | 任务执行总结 | 评估模型能否根据子任务的执行结果进行总结分析,完成原始任务目标,正确地按指令输出回复 | 略|
| 智能体 | 多轮交互 | 评估模型在进行多轮复杂工具调用时的能力,在多轮情况下能否准确理解意图 | 略 |

View File

@ -124,4 +124,4 @@ datasets = math_datasets
models = lmdeploy_qwen2_5_7b_instruct_model
work_dir = 'math_prm800k_500_cascade_evaluator'
work_dir = 'math_prm800k_500_cascade_evaluator'

View File

@ -44,4 +44,4 @@ Here is an example of a simple ARC-AGI task that has three training pairs along
| Qwen2.5-72B-Instruct | LLaMA3.1-70B-Instruct | gemma-2-27b-it |
| ----- | ----- | ----- |
| 0.09 | 0.06 | 0.05 |
| 0.09 | 0.06 | 0.05 |

View File

@ -1,4 +1,4 @@
from mmengine.config import read_base
with read_base():
from .arc_prize_public_evaluation_gen_872059 import arc_prize_public_evaluation_datasets # noqa: F401, F403
from .arc_prize_public_evaluation_gen_872059 import arc_prize_public_evaluation_datasets # noqa: F401, F403

View File

@ -53,4 +53,4 @@ arc_prize_public_evaluation_datasets = [
infer_cfg=arc_prize_public_evaluation_infer_cfg,
eval_cfg=arc_prize_public_evaluation_eval_cfg
)
]
]

View File

@ -53,4 +53,4 @@ arc_prize_public_evaluation_datasets = [
infer_cfg=arc_prize_public_evaluation_infer_cfg,
eval_cfg=arc_prize_public_evaluation_eval_cfg
)
]
]

View File

@ -19,4 +19,4 @@ Q以下陈述是否包含时代错误请选择正确选项。一个接受
选项:
(A) 是
(B) 否
A:(B)
A:(B)

View File

@ -22,4 +22,4 @@ Q以下陈述是否包含时代错误请选择正确选项。一个接受
(A) 是
(B) 否
ALet's think step by step.
This statement mentions that in 1998, New China was established in 1949, and the five-star red flag was designated as the national flag of China. Therefore, October 1, 1998 is National Day, and it is reasonable for people to celebrate National Day at Tiananmen Square, excluding historical errors. So the answer is (B).
This statement mentions that in 1998, New China was established in 1949, and the five-star red flag was designated as the national flag of China. Therefore, October 1, 1998 is National Day, and it is reasonable for people to celebrate National Day at Tiananmen Square, excluding historical errors. So the answer is (B).

View File

@ -60,4 +60,4 @@ Step-by-step answer:
1.This statement mentions that in 1998, New China was established in 1949
2.The Five Star Red Flag was designated as the national flag of China in 1949.
3.So October 1st, 1998 is National Day, and it is reasonable for people to celebrate National Day at Tiananmen Square, so the statement does not contain any historical errors.
So the answer is (B).
So the answer is (B).

View File

@ -22,4 +22,4 @@ Q以下陈述是否包含时代错误请选择正确选项。一个接受
(A) 是
(B) 否
A让我们一步一步来思考。
这个陈述提到了1998年新中国是1949年成立的五星红旗在1949年被确定为中国国旗因此1998年10月1日是国庆节人们在天安门庆祝国庆是合理的因此陈述不包含时代错误。所以答案是(B)。
这个陈述提到了1998年新中国是1949年成立的五星红旗在1949年被确定为中国国旗因此1998年10月1日是国庆节人们在天安门庆祝国庆是合理的因此陈述不包含时代错误。所以答案是(B)。

View File

@ -22,4 +22,4 @@ Q: 和这些歌曲《夜曲》、《本草纲目》、《听妈妈的话》、
(B)《年少有为》
(C)《浮夸》
(D)《三人游》
A: (A)
A: (A)

View File

@ -37,4 +37,4 @@ ALet's think step by step.
"Herbalist's Manual" is a song performed by Jay Chou, with lyrics by Vincent Fang, music by Jay Chou, and arrangement by Michael Lin. It is included in Jay Chou's 2006 album "Still Fantasy."
"Listen to Your Mother" is a song performed by Jay Chou, with lyrics and music by Jay Chou, arrangement by Michael Lin and Hong Jingyao. It is included in Jay Chou's 2006 album "Still Fantasy."
"Common Jasmine Orange" is a song performed by Jay Chou, with lyrics by Vincent Fang, music by Jay Chou, and arrangement by Chung Hsin-min. It is included in Jay Chou's self-titled album "Common Jasmine Orange" released in 2004.
These are all famous pop songs performed by Jay Chou. The only song among the options that seems to have something in common with these songs is "Nunchucks" (performed by Jay Chou, composed by Jay Chou, lyrics by Vincent Fang, arrangement by Chung Hsin-min, included in Jay Chou's 2001 album "Fantasy"). So the answer is (A).
These are all famous pop songs performed by Jay Chou. The only song among the options that seems to have something in common with these songs is "Nunchucks" (performed by Jay Chou, composed by Jay Chou, lyrics by Vincent Fang, arrangement by Chung Hsin-min, included in Jay Chou's 2001 album "Fantasy"). So the answer is (A).

View File

@ -73,4 +73,4 @@ Step-by-step answer:
3."Listen to Your Mother" is a song performed by Jay Chou, with lyrics and music by Jay Chou, arrangement by Michael Lin and Hong Jingyao. It is included in Jay Chou's 2006 album "Still Fantasy."
4."Common Jasmine Orange" is a song performed by Jay Chou, with lyrics by Vincent Fang, music by Jay Chou, and arrangement by Chung Hsin-min. It is included in Jay Chou's self-titled album "Common Jasmine Orange" released in 2004.
5.These are all famous pop songs performed by Jay Chou. The only song among the options that seems to have something in common with these songs is "Nunchucks" (performed by Jay Chou, composed by Jay Chou, lyrics by Vincent Fang, arrangement by Chung Hsin-min, included in Jay Chou's 2001 album "Fantasy").
So the answer is (A).
So the answer is (A).

View File

@ -37,4 +37,4 @@ A: 让我们一步一步来思考。
《本草纲目》是周杰伦演唱的一首歌曲由方文山作词周杰伦作曲林迈可编曲收录在周杰伦2006年发行的专辑《依然范特西》中。
《听妈妈的话》是周杰伦演唱的一首歌曲由周杰伦作词、作曲林迈可、洪敬尧编曲收录在周杰伦2006年发行的专辑《依然范特西》中。
《七里香》是周杰伦演唱的一首歌曲由方文山作词周杰伦谱曲钟兴民编曲收录在周杰伦2004年发行的同名专辑《七里香》中。
这些都是由周杰伦演唱的中国著名流行音乐歌曲在所有选项中唯一与这些歌曲有相同点的歌曲似乎是《双节棍》由周杰伦演唱由周杰伦作曲方文山作词钟兴民编曲收录于周杰伦2001年发行的专辑《范特西》中。所以答案是(A)。
这些都是由周杰伦演唱的中国著名流行音乐歌曲在所有选项中唯一与这些歌曲有相同点的歌曲似乎是《双节棍》由周杰伦演唱由周杰伦作曲方文山作词钟兴民编曲收录于周杰伦2001年发行的专辑《范特西》中。所以答案是(A)。

View File

@ -22,4 +22,4 @@ Q:语句一:昨天,小明在山上看落日,感叹道:"夕阳无限好
(A) 蕴含
(B) 矛盾
(C) 无关
A:(B)
A:(B)

View File

@ -25,4 +25,4 @@ Q:语句一:昨天,小明在山上看落日,感叹道:"夕阳无限好
(B) 矛盾
(C) 无关
A: Let's think step by step.
The first sentence states that Xiaoming saw the sunset on the mountain yesterday, while the second sentence states that it rained yesterday and Xiaoming did not go out. There is a contradiction between these two sentences, because if Xiaoming had not gone out, he could not have seen the sunset on the mountain. So, the relationship between these two sentences is contradictory. So the answer is (B).
The first sentence states that Xiaoming saw the sunset on the mountain yesterday, while the second sentence states that it rained yesterday and Xiaoming did not go out. There is a contradiction between these two sentences, because if Xiaoming had not gone out, he could not have seen the sunset on the mountain. So, the relationship between these two sentences is contradictory. So the answer is (B).

View File

@ -64,4 +64,4 @@ What is the relationship between these two statements?
Step-by-step answer:
1.The first sentence states that Xiaoming saw the sunset on the mountain yesterday, while the second sentence states that it rained yesterday and Xiaoming did not go out.
2.There is a contradiction between these two sentences, because if Xiaoming had not gone out, he could not have seen the sunset on the mountain. So, the relationship between these two sentences is contradictory.
So the answer is (B).
So the answer is (B).

View File

@ -25,4 +25,4 @@ Q:语句一:昨天,小明在山上看落日,感叹道:"夕阳无限好
(B) 矛盾
(C) 无关
A:让我们一步一步来思考。
第一句话说小明昨天在山上看到了落日,而第二句话说昨天下雨,小明没有出门。这两句话之间存在矛盾,因为如果小明没有出门,那么他就不可能在山上看到落日。所以,这两句话之间的关系是矛盾的。所以答案是(B)。
第一句话说小明昨天在山上看到了落日,而第二句话说昨天下雨,小明没有出门。这两句话之间存在矛盾,因为如果小明没有出门,那么他就不可能在山上看到落日。所以,这两句话之间的关系是矛盾的。所以答案是(B)。

View File

@ -20,4 +20,4 @@ Q:“知人论世”作为一种文学批评的原则和方法,最早由战国
(B) “知人论世”作为一种文学评论发沿用至今并显现了强大的生命力
(C) “知人论世”可以帮助后人交结古人和古人成为知音
(D) 了解古人和他所处的时代,有助于理解他的作品
A: (B)
A: (B)

View File

@ -22,4 +22,4 @@ Q:“知人论世”作为一种文学批评的原则和方法,最早由战国
(C) “知人论世”可以帮助后人交结古人和古人成为知音
(D) 了解古人和他所处的时代,有助于理解他的作品
A: Let's think step by step.
From this passage, we cannot see (B) that "understanding people and discussing the world" as a literary criticism has been used to this day and has shown strong vitality. Although "knowing people and discussing the world" was indeed proposed by the philosopher Mencius during the Warring States period as a principle and method of literary criticism, this passage does not mention that "knowing people and discussing the world" is still in use today, or that it has shown strong vitality. Therefore, option (B) is an incorrect understanding. So the answer is (B).
From this passage, we cannot see (B) that "understanding people and discussing the world" as a literary criticism has been used to this day and has shown strong vitality. Although "knowing people and discussing the world" was indeed proposed by the philosopher Mencius during the Warring States period as a principle and method of literary criticism, this passage does not mention that "knowing people and discussing the world" is still in use today, or that it has shown strong vitality. Therefore, option (B) is an incorrect understanding. So the answer is (B).

View File

@ -59,4 +59,4 @@ Step-by-step answer:
1.From this passage, we cannot see (B) that "understanding people and discussing the world" as a literary criticism has been used to this day and has shown strong vitality.
2.Although "knowing people and discussing the world" was indeed proposed by the philosopher Mencius during the Warring States period as a principle and method of literary criticism, this passage does not mention that "knowing people and discussing the world" is still in use today, or that it has shown strong vitality.
3.Therefore, option (B) is an incorrect understanding.
So the answer is (B).
So the answer is (B).

View File

@ -23,4 +23,4 @@ Q:“知人论世”作为一种文学批评的原则和方法,最早由战国
(C) “知人论世”可以帮助后人交结古人和古人成为知音
(D) 了解古人和他所处的时代,有助于理解他的作品
A:让我们一步一步来思考。
从这段话中我们看不到B“知人论世”作为一种文学批评已经沿用至今并显示出强大的生命力。虽然“知人论世”确实是战国时期哲学家孟子提出的一种文学批评的原则和方法但这段话并没有提到“知人论世”在今天仍在使用也没有提到它已经显示出强大的生命力。因此选项B是一种错误的理解。所以答案是B
从这段话中我们看不到B“知人论世”作为一种文学批评已经沿用至今并显示出强大的生命力。虽然“知人论世”确实是战国时期哲学家孟子提出的一种文学批评的原则和方法但这段话并没有提到“知人论世”在今天仍在使用也没有提到它已经显示出强大的生命力。因此选项B是一种错误的理解。所以答案是B

View File

@ -19,4 +19,4 @@ Q:下列制作老式棒棒糖的步骤正确的是?选项:
(B) 准备材料、制作糖浆、将糖浆倒入模具、冷却定型
(C) 准备材料、将糖浆倒入模具、冷却定型、制作糖浆
(D) 准备材料、冷却定型、制作糖浆、将糖浆倒入模具
A(B)
A(B)

View File

@ -22,4 +22,4 @@ Q:下列制作老式棒棒糖的步骤正确的是?选项:
(C) 准备材料、将糖浆倒入模具、冷却定型、制作糖浆
(D) 准备材料、冷却定型、制作糖浆、将糖浆倒入模具
A: Let's think step by step.
The title mentions the steps to make old-fashioned lollipops, and the options include "preparing materials", "pouring syrup into the mold", "making syrup", and "cooling and shaping". According to the steps to make old-fashioned lollipops, the first step should be to prepare the materials, then make syrup, pour syrup into the mold, and finally cool and shape. So the answer is (B).
The title mentions the steps to make old-fashioned lollipops, and the options include "preparing materials", "pouring syrup into the mold", "making syrup", and "cooling and shaping". According to the steps to make old-fashioned lollipops, the first step should be to prepare the materials, then make syrup, pour syrup into the mold, and finally cool and shape. So the answer is (B).

View File

@ -59,4 +59,4 @@ Request: Which of the following steps is correct for making old-fashioned lollip
Step-by-step answer:
1.The title mentions the steps to make old-fashioned lollipops, and the options include "preparing materials", "pouring syrup into the mold", "making syrup", and "cooling and shaping".
2.According to the steps to make old-fashioned lollipops, the first step should be to prepare the materials, then make syrup, pour syrup into the mold, and finally cool and shape.
So the answer is (B).
So the answer is (B).

View File

@ -22,4 +22,4 @@ Q:下列制作老式棒棒糖的步骤正确的是?选项:
(C) 准备材料、将糖浆倒入模具、冷却定型、制作糖浆
(D) 准备材料、冷却定型、制作糖浆、将糖浆倒入模具
A让我们一步一步来思考。
题目中提到了制作老式棒棒糖的步骤,选项中包含“准备材料”、“将糖浆倒入模具”、“制作糖浆”、“冷却定型”四个步骤,按照制作老式棒棒糖的步骤第一步应该是准备材料,然后需要制作糖浆,再把糖浆倒入模具,最后冷却定型。所以答案是(B)。
题目中提到了制作老式棒棒糖的步骤,选项中包含“准备材料”、“将糖浆倒入模具”、“制作糖浆”、“冷却定型”四个步骤,按照制作老式棒棒糖的步骤第一步应该是准备材料,然后需要制作糖浆,再把糖浆倒入模具,最后冷却定型。所以答案是(B)。

View File

@ -16,4 +16,4 @@ Q: 下面的句子可信吗? "运动员范智毅射门得分"
选项:
(A)可信
(B)不可信
A: (A)
A: (A)

View File

@ -19,4 +19,4 @@ Q: 下面的句子可信吗? "运动员范智毅射门得分"
(A)可信
(B)不可信
A: Let's think step by step.
Fan Zhiyi is a Chinese football player, and "scoring a goal" is a term used in football. Since he is a professional football player, he is very likely to score a goal. So the answer is (A).
Fan Zhiyi is a Chinese football player, and "scoring a goal" is a term used in football. Since he is a professional football player, he is very likely to score a goal. So the answer is (A).

View File

@ -19,4 +19,4 @@ Q: 下面的句子可信吗? "运动员范智毅射门得分"
(A)可信
(B)不可信
A: 让我们一步一步来思考。
范智毅是一位中国的足球运动员,射门得分是足球运动术语。由于他是一名专业的足球运动员,因此他很可能射门得分,所以答案是(A)。
范智毅是一位中国的足球运动员,射门得分是足球运动术语。由于他是一名专业的足球运动员,因此他很可能射门得分,所以答案是(A)。

View File

@ -22,4 +22,4 @@ Q小李在亥时三刻开始制作画画他知道他需要10个小时才
(B) 卯时
(C) 午时
(D) 未时
A(A)
A(A)

View File

@ -25,4 +25,4 @@ Q小李在亥时三刻开始制作画画他知道他需要10个小时才
(C) 午时
(D) 未时
ALet's think step by step.
According to the ancient Chinese timing method, the third quarter of the pig hour refers to approximately 21:45 minutes, and 10 hours later it is 7:45 minutes, which is the Dragon Hour . So the answer is (A).
According to the ancient Chinese timing method, the third quarter of the pig hour refers to approximately 21:45 minutes, and 10 hours later it is 7:45 minutes, which is the Dragon Hour . So the answer is (A).

View File

@ -65,4 +65,4 @@ Option:
Step-by-step answer:
1.According to the ancient Chinese timing method, the third quarter of the hour of Hai refers to approximately 21:45,
2.Ten hours later, it will be 7:45, which is Dragon Hour .
So the answer is (A).
So the answer is (A).

View File

@ -25,4 +25,4 @@ Q小李在亥时三刻开始制作画画他知道他需要10个小时才
(C) 午时
(D) 未时
A让我们一步一步来思考。
按照中国古代计时方法亥时三刻大约指的是21点45分10个小时后是7点45分是辰时。所以答案是(A)。
按照中国古代计时方法亥时三刻大约指的是21点45分10个小时后是7点45分是辰时。所以答案是(A)。

View File

@ -19,4 +19,4 @@ Q以下陈述是否包含时代错误一个接受了义务教育、具备
选项:
(A) 是
(B) 否
A(A)
A(A)

View File

@ -22,4 +22,4 @@ Q以下陈述是否包含时代错误一个接受了义务教育、具备
(A) 是
(B) 否
A: Let's think step by step.
The statement mentions Sally Hemings, who was an enslaved woman in the United States during the late 18th and early 19th centuries. However, the mention of watching HBOs True Detective, which is a modern television show, introduces an anachronism. During Sally Hemings time, television did not exist, and the specific mention of watching a specific show like True Detective is clearly out of place for that historical period. So the answer is (A).
The statement mentions Sally Hemings, who was an enslaved woman in the United States during the late 18th and early 19th centuries. However, the mention of watching HBOs True Detective, which is a modern television show, introduces an anachronism. During Sally Hemings time, television did not exist, and the specific mention of watching a specific show like True Detective is clearly out of place for that historical period. So the answer is (A).

View File

@ -58,4 +58,4 @@ Options:
Step-by-step answer:
1.The statement mentions Sally Hemings, who was an enslaved woman in the United States during the late 18th and early 19th centuries.
2.However, the mention of watching HBOs True Detective, which is a modern television show, introduces an anachronism. During Sally Hemings time, television did not exist, and the specific mention of watching a specific show like True Detective is clearly out of place for that historical period.
So the answer is (A).
So the answer is (A).

View File

@ -22,4 +22,4 @@ Q以下陈述是否包含时代错误一个接受了义务教育、具备
(A) 是
(B) 否
A让我们一步一步来思考。
这个陈述提到了萨莉·海明斯,她是 18 世纪末到 19 世纪初美国的一个被奴役的女性。然而,提到她边看 HBO 的《真探》边织毛衣引入了一个时代错误。在萨莉·海明斯所处的时代,电视是不存在的,而且具体提到观看像《真探》这样的特定节目在那个历史时期显然是不合适的。所以答案是(A)。
这个陈述提到了萨莉·海明斯,她是 18 世纪末到 19 世纪初美国的一个被奴役的女性。然而,提到她边看 HBO 的《真探》边织毛衣引入了一个时代错误。在萨莉·海明斯所处的时代,电视是不存在的,而且具体提到观看像《真探》这样的特定节目在那个历史时期显然是不合适的。所以答案是(A)。

View File

@ -22,4 +22,4 @@ Q: 寻找一部与《惊世骇案》、《勇敢的心》、《低俗小说》
(B)《肖申克的救赎》
(C)《蜘蛛侠2》
(D)《出租车》
A: (B)
A: (B)

View File

@ -37,4 +37,4 @@ A: Let's think step by step.
Shawshank Redemption is a 1994 American drama film directed by Frank Delabond and starring Tim Robbins and Morgan Freeman. It is a film about hope and perseverance.
Spider Man 2 is a 2004 American superhero film directed by Sam Remy and starring Toby Maguire. It is the second installment of the Spider Man trilogy.
"Taxi" is a 2004 American comedy action film directed by Tim Storey, starring Jimmy Flanders and Quentin Latafa. This movie is an American remake of a 1998 French film.
And the titles of "The Amazing Case", "Brave Heart", "Pulp Fiction", and "Schindler's List" are all very profound, plot rich, and have strong human themes in movies. They have all won high praise from audiences and critics for their excellent scripts, brilliant performances, and profound themes. The Shawshank Redemption tells the story of a wrongly accused banker who maintains hope in prison and ultimately escapes. The plot of this movie is deeply ingrained in people's hearts, with a profound portrayal of human nature, and there are many similarities with the movie in the title. So the answer is (B).
And the titles of "The Amazing Case", "Brave Heart", "Pulp Fiction", and "Schindler's List" are all very profound, plot rich, and have strong human themes in movies. They have all won high praise from audiences and critics for their excellent scripts, brilliant performances, and profound themes. The Shawshank Redemption tells the story of a wrongly accused banker who maintains hope in prison and ultimately escapes. The plot of this movie is deeply ingrained in people's hearts, with a profound portrayal of human nature, and there are many similarities with the movie in the title. So the answer is (B).

View File

@ -73,4 +73,4 @@ Step-by-step answer:
3.Spider Man 2 is a 2004 American superhero film directed by Sam Remy and starring Toby Maguire. It is the second installment of the Spider Man trilogy.
4."Taxi" is a 2004 American comedy action film directed by Tim Storey, starring Jimmy Flanders and Quentin Latafa. This movie is an American remake of a 1998 French film.
5.And the titles of "The Amazing Case", "Brave Heart", "Pulp Fiction", and "Schindler's List" are all very profound, plot rich, and have strong human themes in movies. They have all won high praise from audiences and critics for their excellent scripts, brilliant performances, and profound themes. The Shawshank Redemption tells the story of a wrongly accused banker who maintains hope in prison and ultimately escapes. The plot of this movie is deeply ingrained in people's hearts, with a profound portrayal of human nature, and there are many similarities with the movie in the title.
So the answer is (B).
So the answer is (B).

View File

@ -37,4 +37,4 @@ A: 让我们一步一步来思考。
《肖申克的救赎》是一部1994年的美国剧情电影由弗兰克·德拉邦特执导蒂姆·罗宾斯和摩根·弗里曼主演这是一部讲述希望和毅力的电影。
《蜘蛛侠2》是一部2004年的美国超级英雄电影由萨姆·雷米执导托比·马奎尔主演是《蜘蛛侠》三部曲的第二部。
《出租车》这是一部2004年的美国喜剧动作片由蒂姆·斯托瑞执导吉米·福兰和昆汀·拉塔法主演。这部电影是1998年法国电影的美国翻拍版。
而题目中《惊世骇案》、《勇敢的心》、《低俗小说》和《辛德勒的名单》都是一些非常深刻、情节丰富且具有强烈人性主题的电影。它们都以其出色的剧本、精彩的表演和深刻的主题赢得了观众和评论家的高度赞誉。选项中《肖申克的救赎》讲述了一名被冤枉的银行家如何在监狱中保持希望,并最终逃脱的故事。这部电影的情节深入人心,人性描绘深刻,与题目中的电影有许多相似之处。所以答案是(B)。
而题目中《惊世骇案》、《勇敢的心》、《低俗小说》和《辛德勒的名单》都是一些非常深刻、情节丰富且具有强烈人性主题的电影。它们都以其出色的剧本、精彩的表演和深刻的主题赢得了观众和评论家的高度赞誉。选项中《肖申克的救赎》讲述了一名被冤枉的银行家如何在监狱中保持希望,并最终逃脱的故事。这部电影的情节深入人心,人性描绘深刻,与题目中的电影有许多相似之处。所以答案是(B)。

View File

@ -22,4 +22,4 @@ Q:语句一:你不确定你已经清楚你站着谁的一面。
(A) 蕴含
(B) 矛盾
(C) 无关
A:(B)
A:(B)

View File

@ -25,4 +25,4 @@ Q:语句一:你不确定你已经清楚你站着谁的一面。
(B) 矛盾
(C) 无关
A: Let's think step by step.
The first sentence indicates that you are not sure who you support, while the second sentence clearly indicates that your position is obvious, which means you are clear about who you support. Therefore, the content of these two sentences is contradictory to each other. So the answer is (B).
The first sentence indicates that you are not sure who you support, while the second sentence clearly indicates that your position is obvious, which means you are clear about who you support. Therefore, the content of these two sentences is contradictory to each other. So the answer is (B).

View File

@ -66,4 +66,4 @@ What is the relationship between these two statements?
Step-by-step answer:
1.The first sentence indicates that you are not sure who you support, while the second sentence clearly indicates that your position is obvious, which means you are clear about who you support.
2.Therefore, the content of these two sentences is contradictory to each other.
So the answer is (B).
So the answer is (B).

View File

@ -25,4 +25,4 @@ Q:语句一:你不确定你已经清楚你站着谁的一面。
(B) 矛盾
(C) 无关
A:让我们一步一步来思考。
第一句话表示你不确定你支持谁,而第二句话则明确表明你的立场是显而易见的,也就是说你清楚自己支持谁。因此,这两句话的内容是相互矛盾的。所以答案是(B)。
第一句话表示你不确定你支持谁,而第二句话则明确表明你的立场是显而易见的,也就是说你清楚自己支持谁。因此,这两句话的内容是相互矛盾的。所以答案是(B)。

View File

@ -19,4 +19,4 @@ Q:一本书要成为畅销书,必须有可读性或者经过精心的包装. 如
(B) 没经过精心包装的书一定不是畅销书
(C) 有可读性的书一定是畅销书
(D) 大多数人喜欢有可读性的畅销书
A:(A)
A:(A)

View File

@ -22,4 +22,4 @@ Q:一本书要成为畅销书,必须有可读性或者经过精心的包装. 如
(C) 有可读性的书一定是畅销书
(D) 大多数人喜欢有可读性的畅销书
A: Let's think step by step
According to the above conclusion, "To become a bestseller, a book must have readability or be carefully packaged." Therefore, (A) "A book without readability or careful packaging is definitely not a bestseller.". Because if a book is neither readable nor carefully packaged, it does not meet the conditions to become a bestseller. So the answer is (A).
According to the above conclusion, "To become a bestseller, a book must have readability or be carefully packaged." Therefore, (A) "A book without readability or careful packaging is definitely not a bestseller.". Because if a book is neither readable nor carefully packaged, it does not meet the conditions to become a bestseller. So the answer is (A).

View File

@ -58,4 +58,4 @@ Request: For a book to be a bestseller,it must be readable or well packaged. If
Step-by-step answer:
1.According to the above conclusion, "To become a bestseller, a book must have readability or be carefully packaged." Therefore, (A) "A book without readability or careful packaging is definitely not a bestseller.".
2.Because if a book is neither readable nor carefully packaged, it does not meet the conditions to become a bestseller.
So the answer is (A).
So the answer is (A).

View File

@ -22,4 +22,4 @@ Q:一本书要成为畅销书,必须有可读性或者经过精心的包装. 如
(C) 有可读性的书一定是畅销书
(D) 大多数人喜欢有可读性的畅销书
A:让我们一步一步来思考。
根据上述断定,“一本书要成为畅销书,必须有可读性或者经过精心的包装”,所以 (A) "没有可读性又没有精心包装的书一定不是畅销书" 是一定为真的。因为如果一本书既没有可读性,也没有经过精心的包装,那么它就不满足成为畅销书的条件。所以答案是(A)。
根据上述断定,“一本书要成为畅销书,必须有可读性或者经过精心的包装”,所以 (A) "没有可读性又没有精心包装的书一定不是畅销书" 是一定为真的。因为如果一本书既没有可读性,也没有经过精心的包装,那么它就不满足成为畅销书的条件。所以答案是(A)。

View File

@ -19,4 +19,4 @@ Q:以下哪个列表按时间顺序排列正确?选项:
(B) 世纪末的魔术师,贝克街的亡灵,通往天国的倒计时,侦探们的镇魂歌
(C) 世纪末的魔术师,通往天国的倒计时,贝克街的亡灵,侦探们的镇魂歌
(D) 世纪末的魔术师,贝克街的亡灵,侦探们的镇魂歌,通往天国的倒计时
A(C)
A(C)

View File

@ -22,4 +22,4 @@ Q:以下哪个列表按时间顺序排列正确?选项:
(C) 世纪末的魔术师,通往天国的倒计时,贝克街的亡灵,侦探们的镇魂歌
(D) 世纪末的魔术师,贝克街的亡灵,侦探们的镇魂歌,通往天国的倒计时
A: Let's think step by step.
The options mention the names of four Conan theater versions, among which the Magician of the End of the Century was released in 1999, the Undead on Baker Street was released in 2002, the Countdown to Heaven was released in 2001, and the Souls of Detectives was released in 2006. Therefore, in chronological order, they are the Magician of the End of the Century, the Countdown to Heaven, the Undead on Baker Street, and the Souls of Detectives. So the answer is (C).
The options mention the names of four Conan theater versions, among which the Magician of the End of the Century was released in 1999, the Undead on Baker Street was released in 2002, the Countdown to Heaven was released in 2001, and the Souls of Detectives was released in 2006. Therefore, in chronological order, they are the Magician of the End of the Century, the Countdown to Heaven, the Undead on Baker Street, and the Souls of Detectives. So the answer is (C).

View File

@ -57,4 +57,4 @@ Request:Which of the following lists is correct in chronological order? Options:
Step-by-step answer:
1.The options mention the names of four Conan theater versions, among which the Magician of the End of the Century was released in 1999, the Undead on Baker Street was released in 2002, the Countdown to Heaven was released in 2001, and the Souls of Detectives was released in 2006.
2.Therefore, in chronological order, they are the Magician of the End of the Century, the Countdown to Heaven, the Undead on Baker Street, and the Souls of Detectives.
So the answer is (C).
So the answer is (C).

View File

@ -22,4 +22,4 @@ Q:以下哪个列表按时间顺序排列正确?选项:
(C) 世纪末的魔术师,通往天国的倒计时,贝克街的亡灵,侦探们的镇魂歌
(D) 世纪末的魔术师,贝克街的亡灵,侦探们的镇魂歌,通往天国的倒计时
A让我们一步一步来思考。
选项中提到了四个柯南剧场版的名称其中世纪末的魔术师是1999年上映的贝克街的亡灵是2002年上映的通往天国的倒计时是2001年上映的侦探们的镇魂歌是2006年上映的因此按时间排序是世纪末的魔术师通往天国的倒计时贝克街的亡灵侦探们的镇魂歌。所以答案是(C)。
选项中提到了四个柯南剧场版的名称其中世纪末的魔术师是1999年上映的贝克街的亡灵是2002年上映的通往天国的倒计时是2001年上映的侦探们的镇魂歌是2006年上映的因此按时间排序是世纪末的魔术师通往天国的倒计时贝克街的亡灵侦探们的镇魂歌。所以答案是(C)。

View File

@ -16,4 +16,4 @@ Q: 下面的句子可信吗? "勒布朗詹姆斯打进老鹰球"
选项:
(A) 可信
(B) 不可信
A: (B)
A: (B)

View File

@ -19,4 +19,4 @@ Q: 下面的句子可信吗? "勒布朗詹姆斯打进老鹰球"
(A) 可信
(B) 不可信
A: Let's think step by step.
LeBron James is a well-known American professional basketball player. An eagle is a term in golf, meaning two strokes under par. Therefore, this sentence is not credible. The correct answer should be (B).
LeBron James is a well-known American professional basketball player. An eagle is a term in golf, meaning two strokes under par. Therefore, this sentence is not credible. The correct answer should be (B).

View File

@ -54,4 +54,4 @@ Step-by-step answer:
1.LeBron James is a well-known professional basketball player from the United States.
2.An eagle is a term in golf, meaning two strokes under par.
3.Therefore, this sentence is not credible.
So the answer is (B).
So the answer is (B).

View File

@ -19,4 +19,4 @@ Q: 下面的句子可信吗? "勒布朗詹姆斯打进老鹰球"
(A) 可信
(B) 不可信
A: 让我们一步一步来思考。
勒布朗詹姆斯是一位知名的美国职业篮球运动员,老鹰球是高尔夫球术语,就是比标准杆少了两杆的意思。所以这个句子是不可信的,正确答案应该是(B)。
勒布朗詹姆斯是一位知名的美国职业篮球运动员,老鹰球是高尔夫球术语,就是比标准杆少了两杆的意思。所以这个句子是不可信的,正确答案应该是(B)。

View File

@ -24,4 +24,4 @@ Q: 请根据题目选择正确答案。李静和王伟在1958年1月2日结婚
(D)1960年10月14日
(E)1963年01月03日
(F)1960年12月03日
A: (E)
A: (E)

View File

@ -27,4 +27,4 @@ Q: 请根据题目选择正确答案。李静和王伟在1958年1月2日结婚
(E)1963年01月03日
(F)1960年12月03日
A: Let's think step by step.
If Jane and John married on Jan 2, 1958, then and if it is their 5-year anniversary today, then today's date is Jan 2, 1963. The date tomorrow is Jan 3, 1963. So the answer is (E).
If Jane and John married on Jan 2, 1958, then and if it is their 5-year anniversary today, then today's date is Jan 2, 1963. The date tomorrow is Jan 3, 1963. So the answer is (E).

View File

@ -68,4 +68,4 @@ Request:Select the correct answer according to the question. Li Jing and Wang We
Step-by-step answer:
1.If Jane and John married on Jan 2, 1958, then and if it is their 5-year anniversary today, then today's date is Jan 2, 1963.
2.The date tomorrow is Jan 3, 1963.
So the answer is (E).
So the answer is (E).

View File

@ -27,4 +27,4 @@ Q: 请根据题目选择正确答案。李静和王伟在1958年1月2日结婚
(E)1963年01月03日
(F)1960年12月03日
A: 让我们一步一步来思考。
如果李静和王伟在1958年1月2日结婚那么如果今天是他们结婚五周年纪念日那么今天的日期就是1963年1月2日。明天的日期是1963年1月3日也就是1961年01月03日。所以答案是(E)。
如果李静和王伟在1958年1月2日结婚那么如果今天是他们结婚五周年纪念日那么今天的日期就是1963年1月2日。明天的日期是1963年1月3日也就是1961年01月03日。所以答案是(E)。

View File

@ -22,4 +22,4 @@ Options:
(A) Yes
(B) No
A: Let's think step by step.
The statement mentions the year 1998, New China was founded in 1949 and the five-star red flag was established as the national flag of China in 1949, so it is reasonable to expect that October 1, 1998 was a National Day and people celebrated the National Day in Tiananmen Square, so the statement does not contain an anachronism. So the answer is (B).
The statement mentions the year 1998, New China was founded in 1949 and the five-star red flag was established as the national flag of China in 1949, so it is reasonable to expect that October 1, 1998 was a National Day and people celebrated the National Day in Tiananmen Square, so the statement does not contain an anachronism. So the answer is (B).

View File

@ -37,4 +37,4 @@ Nocturne is a song performed by Jay Chou, written by Fong, composed by Jay Chou,
"Materia Medica" is a song by Jay Chou, written by Fong, composed by Jay Chou, and arranged by Michael Lam, and featured on Jay Chou's 2006 album Still Vantage Point.
Listen to Your Mother" is a song by Jay Chou, composed by Jay Chou and arranged by Lin Maike and Hong Jingyao, included in Jay Chou's 2006 album "Still Vantage".
Seven Miles is a song performed by Jay Chou, with lyrics by Fong, music by Jay Chou, and arrangement by Chung Hsing-min, included in Jay Chou's 2004 album of the same name, Seven Miles.
These are all famous Chinese pop songs sung by Jay Chou. Among all the choices, the only song that seems to have something in common with these songs is "Nunchaku" (sung by Jay Chou, composed by Jay Chou, lyrics by Fong Wen-shan, arranged by Chung Hsing-min, and featured in Jay Chou's 2001 album "Fantasia"). So the answer is (A).
These are all famous Chinese pop songs sung by Jay Chou. Among all the choices, the only song that seems to have something in common with these songs is "Nunchaku" (sung by Jay Chou, composed by Jay Chou, lyrics by Fong Wen-shan, arranged by Chung Hsing-min, and featured in Jay Chou's 2001 album "Fantasia"). So the answer is (A).

View File

@ -25,4 +25,4 @@ What is the relationship between these two statements?
(B) contradictory
(C) Irrelevant
A: Let's think step by step.
The first sentence says that Ming saw the sunset on the mountain yesterday, while the second sentence says that it rained yesterday and Ming didn't go out. There is a contradiction between these two sentences because if Ming didn't go out, then he couldn't have seen the sunset on the mountain. So there is a contradiction between these two sentences. So the answer is (B).
The first sentence says that Ming saw the sunset on the mountain yesterday, while the second sentence says that it rained yesterday and Ming didn't go out. There is a contradiction between these two sentences because if Ming didn't go out, then he couldn't have seen the sunset on the mountain. So there is a contradiction between these two sentences. So the answer is (B).

View File

@ -23,4 +23,4 @@ Q:As a principle and method of literary criticism, the concept of "knowing peopl
(C) "Knowing people and discussing the world" can help the descendants to make friends with the ancients and the ancients to become their soulmates.
(D) Knowing about the ancients and their times helps to understand their works.
A: Let us think step by step
From the passage, we fail to see that (B) "Knowing the people of the world" has been used as a kind of literary criticism and has shown a strong vitality. Although it is true that "knowing people and discussing the world" is a principle and method of literary criticism put forward by the philosopher Mencius during the Warring States period, the passage does not mention that "knowing people and discussing the world" is still in use today, nor does it mention that it has shown great vitality. Therefore, option (B) is a misconception. So the answer is (B).
From the passage, we fail to see that (B) "Knowing the people of the world" has been used as a kind of literary criticism and has shown a strong vitality. Although it is true that "knowing people and discussing the world" is a principle and method of literary criticism put forward by the philosopher Mencius during the Warring States period, the passage does not mention that "knowing people and discussing the world" is still in use today, nor does it mention that it has shown great vitality. Therefore, option (B) is a misconception. So the answer is (B).

View File

@ -22,4 +22,4 @@ Q:Which of the following steps is correct for making old-fashioned lollipops? Op
(C) Prepare ingredients, pour syrup into mold, cool and set, make syrup
(D) Prepare ingredients, cool and set, make syrup, pour syrup into molds
A: Let's think step by step.
The question refers to the steps in making old-fashioned lollipops, and the options include "prepare ingredients", "pour syrup into molds", and "make syrup", "According to the steps of making old-fashioned lollipops, the first step is to prepare the ingredients, then you need to make the syrup, then pour the syrup into the molds, and finally cool and set the lollipops. So the answer is (B).
The question refers to the steps in making old-fashioned lollipops, and the options include "prepare ingredients", "pour syrup into molds", and "make syrup", "According to the steps of making old-fashioned lollipops, the first step is to prepare the ingredients, then you need to make the syrup, then pour the syrup into the molds, and finally cool and set the lollipops. So the answer is (B).

View File

@ -19,4 +19,4 @@ Options:
(A) Credible
(B) Not credible
A: Let's think step by step.
Fan Zhiyi is a Chinese soccer player, and scoring a goal is a soccer sport term. Since he is a professional soccer player, he is likely to score a goal, so the answer is (A).
Fan Zhiyi is a Chinese soccer player, and scoring a goal is a soccer sport term. Since he is a professional soccer player, he is likely to score a goal, so the answer is (A).

View File

@ -25,4 +25,4 @@ Options:
(C) Midday
(D) the hour of midnight
A: Let's think step by step.
According to the ancient Chinese method of timekeeping, three moments of the Ohio hour means approximately 21:45, and 10 hours later it is 7:45, which is the hour of the hour of the moon. So the answer is (A).
According to the ancient Chinese method of timekeeping, three moments of the Ohio hour means approximately 21:45, and 10 hours later it is 7:45, which is the hour of the hour of the moon. So the answer is (A).

View File

@ -22,4 +22,4 @@ Options:
(A) Yes
(B) No
A: Let's think step by step.
This statement mentions Sally Hemings, an enslaved woman in late 18th and early 19th century America. However, the reference to her knitting while watching HBO's True Detective introduces an anachronism. Television did not exist in the time period in which Sally Hemings lived, and the specific reference to watching a particular program like True Detective is clearly inappropriate for that time period in history. So the answer is (A).
This statement mentions Sally Hemings, an enslaved woman in late 18th and early 19th century America. However, the reference to her knitting while watching HBO's True Detective introduces an anachronism. Television did not exist in the time period in which Sally Hemings lived, and the specific reference to watching a particular program like True Detective is clearly inappropriate for that time period in history. So the answer is (A).

View File

@ -25,4 +25,4 @@ What is the relationship between these two statements?
(B) Contradictory
(C) Unrelated
A: Let's think step by step.
The first sentence indicates that you are not sure who you support, while the second sentence makes it clear that your position is obvious, i.e. you know exactly who you support. Therefore, the two sentences contradict each other. So the answer is (B).
The first sentence indicates that you are not sure who you support, while the second sentence makes it clear that your position is obvious, i.e. you know exactly who you support. Therefore, the two sentences contradict each other. So the answer is (B).

View File

@ -22,4 +22,4 @@ Q:For a book to be a bestseller,it must be readable or well packaged. If the abo
(C) A book that is readable must be a bestseller
(D) Most people like bestsellers that are readable.
A: Let's think step by step.
According to the above assertion, "For a book to be a bestseller, it must be readable or well-packaged", so (A) "A book that is neither readable nor well-packaged must not be a bestseller" must be true. Because if a book is neither readable nor well-packaged, then it does not fulfill the conditions for being a bestseller. So the answer is (A).
According to the above assertion, "For a book to be a bestseller, it must be readable or well-packaged", so (A) "A book that is neither readable nor well-packaged must not be a bestseller" must be true. Because if a book is neither readable nor well-packaged, then it does not fulfill the conditions for being a bestseller. So the answer is (A).

View File

@ -22,4 +22,4 @@ Q:Which of the following lists is correct in chronological order? Options:
(C) The Magician at the End of the Century, Countdown to Heaven, The Undead of Baker Street, The Detectives' Requiem
(D) The Magician at the End of the Century, The Undead of Baker Street, The Detectives' Requiem, The Countdown to Heaven
A: Let's think step by step.
The options mention the names of the four Conan theatrical releases. The Magician at the End of the Century was released in 1999, The Undead of Baker Street was released in 2002, Countdown to Heaven was released in 2001, and The Detectives' Requiem was released in 2006, so in chronological order it's The Magician at the End of the Century, Countdown to Heaven, The Undead of Baker Street, and The Detectives' Requiem. So the answer is (C).
The options mention the names of the four Conan theatrical releases. The Magician at the End of the Century was released in 1999, The Undead of Baker Street was released in 2002, Countdown to Heaven was released in 2001, and The Detectives' Requiem was released in 2006, so in chronological order it's The Magician at the End of the Century, Countdown to Heaven, The Undead of Baker Street, and The Detectives' Requiem. So the answer is (C).

View File

@ -19,4 +19,4 @@ Options:
(A) Credible
(B) Not credible
A: Let's think step by step.
LeBron James is a well-known American professional basketball player. An eagle is a term in golf, meaning two strokes under par. Therefore, this sentence is not credible. The correct answer should be (B).
LeBron James is a well-known American professional basketball player. An eagle is a term in golf, meaning two strokes under par. Therefore, this sentence is not credible. The correct answer should be (B).

View File

@ -27,4 +27,4 @@ Q: Select the correct answer according to the question. Li Jing and Wang Wei wer
(E)January 03, 1963
(F) December 03, 1960
A: Let's think step by step.
If Li Jing and Wang Wei were married on January 2, 1958, then today's date would be January 2, 1963 if today is their fifth wedding anniversary. Tomorrow's date will be January 3, 1963, which is January 03, 1961. So the answer is (E).
If Li Jing and Wang Wei were married on January 2, 1958, then today's date would be January 2, 1963 if today is their fifth wedding anniversary. Tomorrow's date will be January 3, 1963, which is January 03, 1961. So the answer is (E).

View File

@ -1,4 +1,4 @@
from mmengine.config import read_base
with read_base():
from .IFEval_gen_353ae7 import ifeval_datasets # noqa: F401, F403
from .IFEval_gen_353ae7 import ifeval_datasets # noqa: F401, F403

View File

@ -10,4 +10,4 @@ Math dataset composed of problems from AIME2024 (American Invitational Mathemati
| Qwen2.5-72B-Instruct | Qwen2.5-7B-Instruct | internlm2_5-7b-chat |
| ----------- | ----------- | ----------- |
| 31.25 | 26.44 | 9.13 |
| 31.25 | 26.44 | 9.13 |

View File

@ -36,4 +36,4 @@ aime2024_datasets = [
infer_cfg=aime2024_infer_cfg,
eval_cfg=aime2024_eval_cfg
)
]
]

View File

@ -90,4 +90,4 @@ aime2024_datasets = [
eval_cfg=aime2024_eval_cfg,
mode='singlescore',
)
]
]

View File

@ -92,4 +92,4 @@ aime2024_datasets = [
eval_cfg=aime2024_eval_cfg,
mode='singlescore',
)
]
]

View File

@ -84,4 +84,4 @@ aime2024_datasets = [
eval_cfg=aime2024_eval_cfg,
mode='singlescore',
)
]
]

View File

@ -1,4 +1,4 @@
from mmengine.config import read_base
with read_base():
from .aime2024_gen_17d799 import aime2024_datasets # noqa: F401, F403
from .aime2024_gen_17d799 import aime2024_datasets # noqa: F401, F403

View File

@ -37,4 +37,4 @@ aime2024_datasets = [
infer_cfg=aime2024_infer_cfg,
eval_cfg=aime2024_eval_cfg,
)
]
]

View File

@ -36,4 +36,4 @@ aime2024_datasets = [
infer_cfg=aime2024_infer_cfg,
eval_cfg=aime2024_eval_cfg
)
]
]

View File

@ -1,4 +1,4 @@
from mmengine.config import read_base
with read_base():
from .aime2024_llmjudge_gen_5e9f4f import aime2024_datasets # noqa: F401, F403
from .aime2024_llmjudge_gen_5e9f4f import aime2024_datasets # noqa: F401, F403

View File

@ -93,4 +93,4 @@ aime2024_datasets = [
mode='singlescore',
)
for idx in range(16)
]
]

View File

@ -93,4 +93,4 @@ aime2024_datasets = [
mode='singlescore',
)
for idx in range(8)
]
]

View File

@ -34,4 +34,4 @@ OpenCompass provides the results of some models on the BABILong dataset. The eva
eprint={2406.10149},
archivePrefix={arXiv}
}
```
```

View File

@ -90,4 +90,4 @@ for _name in bbeh_free_form_sets:
abbr=_name,
reader_cfg=bbeh_reader_cfg,
infer_cfg=bbeh_infer_cfg.copy(),
eval_cfg=bbeh_eval_cfg.copy()))
eval_cfg=bbeh_eval_cfg.copy()))

View File

@ -123,4 +123,4 @@ for _name in bbeh_multiple_choice_sets + bbeh_free_form_sets:
infer_cfg=bbeh_infer_cfg,
eval_cfg=bbeh_eval_cfg,
)
)
)

View File

@ -1,4 +1,4 @@
from mmengine.config import read_base
with read_base():
from .bbh_gen_ee62e9 import bbh_datasets # noqa: F401, F403
from .bbh_gen_ee62e9 import bbh_datasets # noqa: F401, F403

Some files were not shown because too many files have changed in this diff Show More