mirror of
https://github.com/open-compass/opencompass.git
synced 2025-05-30 16:03:24 +08:00
update readme for needlebench
This commit is contained in:
parent
f7266e873c
commit
996f8ae43d
@ -1,36 +1,52 @@
|
||||
# Needlebench: A Benchmark for Advanced Language Model Evaluation
|
||||
# Needlebench: A Benchmark for Needle-In-A-Haystack Evaluations
|
||||
|
||||
English | [简体中文](readme_zh-CN.md)
|
||||
|
||||
## Overview
|
||||
|
||||
Needlebench is a comprehensive benchmark suite designed to rigorously evaluate the information retrieval and reasoning capabilities of large language models (LLMs). Drawing inspiration from the NeedleInAHaystack experiment, Needlebench broadens the scope to include a variety of tasks, each aimed at testing different facets of LLMs' abilities to process, recall, and reason with information embedded in lengthy texts.
|
||||
Needlebench is an exhaustive benchmark designed to rigorously assess the information retrieval and reasoning capabilities of large language models (LLMs). Drawing inspiration from the NeedleInAHaystack experiment, Needlebench broadens the scope to include a variety of tasks, each aimed at testing different facets of LLMs' abilities to process, recall, and reason with information embedded in lengthy texts.
|
||||
|
||||
### Directory Structure
|
||||
|
||||
```
|
||||
opencompass/datasets/needlebench/
|
||||
├── atc.py # Ancestral Trace Challenge
|
||||
├── multi.py # Multi-Needles Reasoning
|
||||
├── origin.py # Single-Needle Retrieval
|
||||
├── parallel.py # Multi-Needles Retrieval
|
||||
└── readme.md
|
||||
configs/datasets/needlebench/
|
||||
├── atc
|
||||
├── needlebench_4k
|
||||
├── needlebench_8k
|
||||
├── needlebench_32k
|
||||
├── needlebench_128k
|
||||
├── needlebench_200k
|
||||
├── needlebench.py
|
||||
├── readme.md
|
||||
└── readme_zh-CN.md
|
||||
```
|
||||
|
||||
## Task Descriptions
|
||||
Within each configuration directory (e.g., `needlebench_4k`), there are scripts tailored for testing within that specific length setting:
|
||||
|
||||
### Single-Needle Retrieval (`origin.py`)
|
||||
```
|
||||
needlebench_4k/
|
||||
├── needlebench_multi_reasoning.py
|
||||
├── needlebench_multi_retrieval.py
|
||||
├── needlebench.py
|
||||
└── needlebench_single.py
|
||||
```
|
||||
|
||||
The Single-Needle Retrieval task is foundational to the Needlebench suite, focusing on the LLM's ability to recall a single piece of crucial information from a haystack text of a specific length. This task mirrors the original NeedleInAHaystack test, assessing the model's precision in identifying and recalling specific information within large text bodies.
|
||||
## Task Descriptions and Length Configurations
|
||||
|
||||
### Multi-Needles Retrieval (`parallel.py`)
|
||||
Needlebench offers tasks in various length configurations, such as 4k, 8k, etc., to accommodate different scales of language model evaluation needs. Each length configuration provides specialized test scripts for the following tasks:
|
||||
|
||||
The Multi-Needles Retrieval task challenges the LLM's ability to identify and extract multiple key pieces of information from extensive texts. It simulates real-world scenarios where multiple data points, facts, or figures need to be retrieved from documents or reports, evaluating the model's efficiency in navigating and extracting relevant information from dense texts.
|
||||
### Single-Needle Retrieval (`needlebench_single.py`)
|
||||
|
||||
### Multi-Needles Reasoning (`multi.py`)
|
||||
The Single-Needle Retrieval task evaluates the LLM's ability to recall a single piece of crucial information from a haystack text of a specific length. This task mirrors the original NeedleInAHaystack test's objective, assessing the model's precision in identifying and recalling specific information within large bodies of text.
|
||||
|
||||
Building on the retrieval tasks, the Multi-Needles Reasoning task emphasizes the LLM's capacity for complex reasoning with the retrieved information. The model must not only recall multiple pieces of information but also engage in logical reasoning, synthesizing answers that reflect an understanding of the relationships between various information points.
|
||||
### Multi-Needle Retrieval (`needlebench_multi_retrieval.py`)
|
||||
|
||||
### Ancestral Trace Challenge (ATC) (`atc.py`)
|
||||
The Multi-Needle Retrieval task challenges the LLM's ability to identify and extract multiple key information points from extensive texts. It simulates real-world scenarios where multiple data points, facts, or figures need to be retrieved from documents or reports, evaluating the model's efficiency in navigating and extracting relevant information from dense texts.
|
||||
|
||||
The Ancestral Trace Challenge is the most complex task in the Needlebench suite, requiring models to recall and analyze every detail in long texts for problem-solving that demands an understanding of complex relationships, such as genealogical inquiries or detailed case analysis. This task highlights the need for models to process and reason with information at a detailed level, mirroring the demands of sophisticated real-world analytical tasks.
|
||||
### Multi-Needle Reasoning (`needlebench_multi_reasoning.py`)
|
||||
|
||||
Building on the retrieval tasks, the Multi-Needle Reasoning task emphasizes the LLM's capacity for complex reasoning with the retrieved information. The model must not only recall multiple pieces of information but also engage in logical reasoning, synthesizing answers that reflect an understanding of the intricate relationships between various information points.
|
||||
|
||||
### Ancestral Trace Challenge (ATC)
|
||||
|
||||
The Ancestral Trace Challenge is Needlebench's most complex task, requiring models to recall and analyze every detail in long texts for problem-solving that demands an understanding of complex relationships, such as genealogical inquiries or detailed case analysis. This task highlights the need for models to process and reason with information at a granular level, mirroring the demands of sophisticated real-world analytical tasks.
|
||||
|
@ -9,28 +9,44 @@ Needlebench是一个全面的基准测试,旨在严格评估大型语言模型
|
||||
### 目录结构
|
||||
|
||||
```
|
||||
opencompass/datasets/needlebench/
|
||||
├── atc.py # 祖源追溯挑战
|
||||
├── multi.py # 多针信息推理
|
||||
├── origin.py # 单针信息检索
|
||||
├── parallel.py # 多针信息检索
|
||||
└── readme.md
|
||||
configs/datasets/needlebench/
|
||||
├── atc
|
||||
├── needlebench_4k
|
||||
├── needlebench_8k
|
||||
├── needlebench_32k
|
||||
├── needlebench_128k
|
||||
├── needlebench_200k
|
||||
├── needlebench.py
|
||||
├── readme.md
|
||||
└── readme_zh-CN.md
|
||||
```
|
||||
|
||||
## 任务描述
|
||||
在每个长度配置目录下(如 `needlebench_4k`),包含了专门针对该长度设置的测试任务脚本:
|
||||
|
||||
### 单针信息检索 (`origin.py`)
|
||||
```
|
||||
needlebench_4k/
|
||||
├── needlebench_multi_reasoning.py
|
||||
├── needlebench_multi_retrieval.py
|
||||
├── needlebench.py
|
||||
└── needlebench_single.py
|
||||
```
|
||||
|
||||
单针信息检索任务是Needlebench的基础,关注LLM从特定长度的无关信息文本中回忆单个重要信息的能力。这个任务反映了原始的NeedleInAHaystack测试的目标,评估模型在大型文本体中识别和回忆特定信息的精确性。
|
||||
## 任务描述与长度配置
|
||||
|
||||
### 多针信息检索 (`parallel.py`)
|
||||
Needlebench提供了不同长度配置的任务,如4k、8k等,以适应不同规模的语言模型评估需求。每种长度配置针对以下任务提供了专门的测试脚本:
|
||||
|
||||
### 单针信息检索 (`needlebench_single.py`)
|
||||
|
||||
单针信息检索任务评估LLM从特定长度的无关信息文本中回忆单个重要信息的能力。这个任务反映了原始的NeedleInAHaystack测试的目标,评估模型在大型文本体中识别和回忆特定信息的精确性。
|
||||
|
||||
### 多针信息检索 (`needlebench_multi_retrieval.py`)
|
||||
|
||||
多针信息检索任务挑战LLM识别和提取广泛文本中的多个关键信息点的能力。它模拟了现实世界中的场景,其中需要从文档或报告中检索多个数据点、事实或数字,评估模型在浏览和从密集文本中提取相关信息的效率。
|
||||
|
||||
### 多针信息推理 (`multi.py`)
|
||||
### 多针信息推理 (`needlebench_multi_reasoning.py`)
|
||||
|
||||
在检索任务的基础上,多针信息推理任务强调LLM使用检索到的信息进行复杂推理的能力。模型不仅需要回忆多个信息点,还需要进行逻辑推理,综合回答反映对不同信息点之间复杂关系理解的答案。
|
||||
|
||||
### 祖源追溯挑战 (ATC) (`atc.py`)
|
||||
### 祖源追溯挑战 (ATC)
|
||||
|
||||
祖源追溯挑战是Needlebench套件中最复杂的任务,要求模型回忆和分析长文本中的每个细节,以解决需要理解复杂关系的问题,如家谱查询或详细案例分析。这个任务突出了模型处理和推理详细信息的需要,反映了现实世界中对复杂实际任务的要求。
|
||||
祖源追溯挑战是Needlebench中最复杂的任务,要求模型回忆和分析长文本中的每个细节,以解决需要理解复杂关系的问题,如家谱查询或详细案例分析。这个任务突出了模型处理和推理详细信息的需要,反映了现实世界中对复杂实际任务的要求。
|
||||
|
Loading…
Reference in New Issue
Block a user