This commit is contained in:
Mo Li 2025-05-29 14:22:59 +08:00 committed by GitHub
commit 7a44a80bb9
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
44 changed files with 3781 additions and 405 deletions

View File

@ -1,52 +1,26 @@
# Needle In A Haystack Experimental Evaluation
# Needle In A Haystack Evaluation
## Introduction to the Needle In A Haystack Test
The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py)) is an evaluation method that randomly inserts key information into long texts to form prompts for large language models (LLMs). The test aims to detect whether large models can extract such key information from extensive texts, thereby assessing the models' capabilities in processing and understanding long documents.
The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py)) is an evaluation method where key information is randomly inserted into long texts to form the prompt for large language models (LLMs). This test aims to assess whether LLMs can extract critical information from long texts, thereby evaluating their fundamental ability to comprehend and process long-context documents.
## Task Overview
Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning. For a complete introduction, refer to our [technical report](https://arxiv.org/abs/2407.11963):
Within the `OpenCompass` framework, under `NeedleBench`, we designed a series of progressively challenging evaluation tasks to comprehensively assess LLMs' long-text information extraction and reasoning capabilities. For a complete description, please refer to our [technical report](https://arxiv.org/abs/2407.11963).
- **Single-Needle Retrieval Task (S-RT)**: Assesses an LLM's ability to extract a single key piece of information from a long text, testing its precision in recalling specific details within broad narratives. This corresponds to the **original Needle In A Haystack test** setup.
- **Single-Needle Retrieval Task (S-RT)**: Evaluates the LLM's ability to retrieve a single piece of key information from a long text, testing precise recall of specific details within extensive narratives. This corresponds to the **original Needle In A Haystack test** setup.
- **Multi-Needle Retrieval Task (M-RT)**: Explores an LLM's capability to retrieve multiple related pieces of information from long texts, simulating real-world scenarios of complex queries on comprehensive documents.
- **Multi-Needle Retrieval Task (M-RT)**: Explores the LLM's ability to retrieve multiple relevant pieces of information from long texts, simulating complex queries over comprehensive documents.
- **Multi-Needle Reasoning Task (M-RS)**: Evaluates an LLM's long-text abilities by extracting and utilizing multiple key pieces of information, requiring the model to have a comprehensive understanding of each key information fragment.
- **Multi-Needle Reasoning Task (M-RS)**: Assesses LLMs' abilities to integrate multiple key pieces of information extracted from long texts for reasoning, requiring a comprehensive understanding of content.
- **Ancestral Trace Challenge (ATC)**: Uses the "relational needle" to test an LLM's ability to handle multi-layer logical challenges in real long texts. In the ATC task, a series of logical reasoning questions are used to test the model's memory and analytical skills for every detail in the text. For this task, we remove the irrelevant text (Haystack) setting, designing all texts as critical information, requiring the LLM to use all the content and reasoning in the text accurately to answer the questions.
- **Ancestral Trace Challenge (ATC)**: Tests LLMs' capabilities in handling multi-layer logical challenges within realistic long-text contexts through "kinship trace needles." In the ATC task, no irrelevant (haystack) texts are added; every piece of text is critical, and models must reason through all details for accurate answers.
### Evaluation Steps
> **Note:** NeedleBench (v2) includes several optimizations and adjustments in dataset construction and task details. For a detailed comparison between the old and new versions, as well as a summary of updates, please refer to [opencompass/configs/datasets/needlebench_v2/readme.md](https://github.com/open-compass/opencompass/blob/main/opencompass/configs/datasets/needlebench_v2/readme.md).
> Note: In the latest code, OpenCompass has been set to automatically load the dataset from [Huggingface API](https://huggingface.co/datasets/opencompass/NeedleBench), so you can **skip directly** the following steps of manually downloading and placing the dataset.
## Evaluation Steps
1. Download the dataset from [here](https://github.com/open-compass/opencompass/files/14741330/needlebench.zip).
2. Place the downloaded files in the `opencompass/data/needlebench/` directory. The expected file structure in the `needlebench` directory is shown below:
```
opencompass/
├── configs
├── docs
├── data
│ └── needlebench
│ ├── multi_needle_reasoning_en.json
│ ├── multi_needle_reasoning_zh.json
│ ├── names.json
│ ├── needles.jsonl
│ ├── PaulGrahamEssays.jsonl
│ ├── zh_finance.jsonl
│ ├── zh_game.jsonl
│ ├── zh_government.jsonl
│ ├── zh_movie.jsonl
│ ├── zh_tech.jsonl
│ ├── zh_general.jsonl
├── LICENSE
├── opencompass
├── outputs
├── run.py
├── more...
```
> Note: In the latest `OpenCompass` codebase, the NeedleBench dataset is automatically loaded from the [Huggingface interface](https://huggingface.co/datasets/opencompass/NeedleBench), with no need for manual download or configuration.
### `OpenCompass` Environment Setup
@ -58,115 +32,85 @@ cd opencompass
pip install -e .
```
### Configuring the Dataset
### Dataset Configuration
We have pre-configured datasets for common text lengths (4k, 8k, 32k, 128k, 200k, 1000k) in `configs/datasets/needlebench`, allowing you to flexibly create datasets that meet your needs by defining related parameters in the configuration files.
We have pre-configured various long-context settings (4k, 8k, 32k, 128k, 200k, 1000k) in `opencompass/configs/datasets/needlebench_v2`, and you can flexibly define your parameters by adjusting the configuration files.
### Evaluation Example
#### Evaluating `InternLM2-7B` Model Deployed Using `LMDeploy`
#### Evaluating with `VLLM` Deployed `Qwen2-5-7B` Model
For example, to evaluate the `InternLM2-7B` model deployed using `LMDeploy` for all tasks in NeedleBench-4K, you can directly use the following command in the command line. This command calls the pre-defined model and dataset configuration files without needing to write additional configuration files:
To evaluate the `Qwen2-5-7B` model deployed with `VLLM` on all tasks under NeedleBench-128K, use the following command. This leverages pre-defined model and dataset configuration files without needing additional configuration:
##### Local Evaluation
If you are evaluating the model locally, the command below will utilize all available GPUs on your machine. You can limit the GPU access for `OpenCompass` by setting the `CUDA_VISIBLE_DEVICES` environment variable. For instance, using `CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py ...` will only expose the first four GPUs to OpenCompass, ensuring that it does not use more than these four GPUs.
If evaluating locally, the command will use all available GPUs. You can control GPU visibility using `CUDA_VISIBLE_DEVICES`:
```bash
# Local Evaluation
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer
# Local evaluation
python run.py --dataset needlebench_v2_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer
```
##### Evaluation on a Slurm Cluster
##### Evaluation on Slurm Cluster
If using `Slurm`, you can add parameters such as `--slurm -p partition_name -q reserved --max-num-workers 16`, as shown below:
For Slurm environments, you can add options like `--slurm -p partition_name -q reserved --max-num-workers 16`:
```bash
# Slurm Evaluation
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
# Slurm evaluation
python run.py --dataset needlebench_v2_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
```
##### Evaluating a Subdataset Only
##### Evaluating Specific Subsets
If you only want to test the original NeedleInAHaystack task setup, you could change the dataset parameter to `needlebench_single_4k`, which corresponds to the single needle version of the NeedleInAHaystack test at 4k length:
If you only want to test the original Needle In A Haystack task (e.g., single-needle 128k), adjust the dataset parameter:
```bash
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
python run.py --dataset needlebench_v2_single_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
```
You can also choose to evaluate a specific subdataset, such as changing the `--datasets` parameter to `needlebench_single_4k/needlebench_zh_datasets` for testing just the Chinese version of the single needle 4K length NeedleInAHaystack task. The parameter after `/` represents the subdataset, which can be found in the dataset variable of `configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py` :
To evaluate only Chinese versions, specify the subset dataset after `/`:
```bash
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
python run.py --dataset needlebench_v2_single_128k/needlebench_zh_datasets --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
```
Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool before starting the evaluation:
Ensure `VLLM` is installed beforehand:
```bash
pip install lmdeploy
# Install vLLM with CUDA 12.4.
# For other CUDA versions, please refer to the [official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html)
pip install vllm
```
This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 16` used to specify the Slurm partition name and the maximum number of worker processes.
#### Evaluating Other `Huggingface` Models
For other models, we recommend writing an additional configuration file to modify the model's `max_seq_len` and `max_out_len` parameters so the model can receive the complete long text content, as we have prepared in the `configs/eval_needlebench.py` file. The complete content is as follows:
For other models, it is recommended to write your own config file (such as `examples/eval_needlebench_v2.py`) to adjust `max_seq_len` and `max_out_len`, so that the model can process the full context.
```python
from mmengine.config import read_base
# We use mmengine.config to import variables from other configuration files
with read_base():
# from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
# from .datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets
# from .summarizers.needlebench import needlebench_4k_summarizer as summarizer
# only eval original "needle in a haystack test" in needlebench_4k
from .datasets.needlebench.needlebench_4k.needlebench_single_4k import needlebench_zh_datasets, needlebench_en_datasets
from .summarizers.needlebench import needlebench_4k_summarizer as summarizer
# eval Ancestral Tracing Challenge(ATC)
# from .datasets.needlebench.atc.atc_choice_50 import needlebench_datasets
# from .summarizers.needlebench import atc_summarizer_50 as summarizer
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
for m in internlm2_chat_7b:
m['max_seq_len'] = 30768 # Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support.
m['max_out_len'] = 2000 # Ensure that in the multi-needle recall task, the model can receive a complete response
models = internlm2_chat_7b
work_dir = './outputs/needlebench'
```
Once the test `config` file is written, we can pass the corresponding config file path through the `run.py` file in the command line, such as:
You can then run evaluation with:
```bash
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16
python run.py configs/eval_needlebench_v2.py --slurm -p partition_name -q reserved --max-num-workers 16
```
Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-num-workers` setting to adjust the number of parallel workers.
No need to manually specify `--dataset`, `--models`, or `--summarizer` again.
### Visualization
We have built-in result visualization into the `summarizer` implementation in the latest code version. You can find the corresponding visualizations in the plots directory of the respective output folder, eliminating the need for manual visualization of scores across various depths and lengths.
NeedleBench's latest version has built-in visualization integrated into the summarizer. You can find corresponding visualizations in the `plots` directory under the output folder without needing additional scripts.
If you use this method, please add a reference:
### Citation
If you use NeedleBench, please cite us:
```bibtex
@misc{li2024needlebenchllmsretrievalreasoning,
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?},
author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen},
year={2024},
@misc{li2025needlebenchllmsretrievalreasoning,
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?},
author={Mo Li and Songyang Zhang and Taolin Zhang and Haodong Duan and Yunxin Liu and Kai Chen},
year={2025},
eprint={2407.11963},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.11963},
url={https://arxiv.org/abs/2407.11963},
}
@misc{2023opencompass,
@ -174,8 +118,6 @@ If you use this method, please add a reference:
author={OpenCompass Contributors},
howpublished={\url{https://github.com/open-compass/opencompass}},
year={2023}
}
@misc{LLMTest_NeedleInAHaystack,
@ -187,11 +129,10 @@ If you use this method, please add a reference:
@misc{wei2023skywork,
title={Skywork: A More Open Bilingual Foundation Model},
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei L\"u and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
year={2023},
eprint={2310.19341},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

View File

@ -16,37 +16,11 @@
- **祖先追溯挑战(Ancestral Trace Challenge, ATC)**通过设计“亲属关系针”测试LLM处理真实长文本中多层逻辑挑战的能力。在ATC任务中通过一系列逻辑推理问题检验模型对长文本中每个细节的记忆和分析能力在此任务中我们去掉了无关文本(Haystack)的设定而是将所有文本设计为关键信息LLM必须综合运用长文本中的所有内容和推理才能准确回答问题。
### 评估步骤
> **补充说明**目前NeedleBenchv2在数据集构建和任务细节等方面做了一些小的优化和调整。如果您想了解新旧版本的具体差异和详细更新内容请参考 [opencompass/configs/datasets/needlebench_v2/readme.md](https://github.com/open-compass/opencompass/blob/main/opencompass/configs/datasets/needlebench_v2/readme.md)。
> 注意在最新代码中OpenCompass已经设置数据集从[Huggingface的接口](https://huggingface.co/datasets/opencompass/NeedleBench)中自动加载,可以直接跳过下面的手动下载安放数据集。
## 评估步骤
1. 从[这里](https://github.com/open-compass/opencompass/files/14741330/needlebench.zip)下载数据集。
2. 将下载的文件放置于`opencompass/data/needlebench/`目录下。`needlebench`目录中预期的文件结构如下所示:
```
opencompass/
├── configs
├── docs
├── data
│ └── needlebench
│ ├── multi_needle_reasoning_en.json
│ ├── multi_needle_reasoning_zh.json
│ ├── names.json
│ ├── needles.jsonl
│ ├── PaulGrahamEssays.jsonl
│ ├── zh_finance.jsonl
│ ├── zh_game.jsonl
│ ├── zh_government.jsonl
│ ├── zh_movie.jsonl
│ ├── zh_tech.jsonl
│ ├── zh_general.jsonl
├── LICENSE
├── opencompass
├── outputs
├── run.py
├── more...
```
> 注意在最新的OpenCompass代码中NeedleBench数据集会自动从[Huggingface接口](https://huggingface.co/datasets/opencompass/NeedleBench)加载,无需手动下载或配置数据集,您可以直接运行评测命令。
### `OpenCompass`环境配置
@ -60,13 +34,13 @@ pip install -e .
### 配置数据集
我们在`configs/datasets/needlebench`中已经预先配置好了关于常见长度区间(4k, 8k, 32k, 128k, 200k, 1000k)的长文本测试设定,您可以通过在配置文件中定义相关参数,以灵活地创建适合您需求的数据集。
我们在`opencompass/configs/datasets/needlebench_v2`中已经预先配置好了关于常见长度区间(4k, 8k, 32k, 128k, 200k, 1000k)的长文本测试设定,您可以通过在配置文件中定义相关参数,以灵活地创建适合您需求的数据集。
### 评估示例
#### 使用`LMDeploy`部署的 `InternLM2-7B` 模型进行评估
#### 使用`VLLM`部署的 `Qwen2-5-7B` 模型进行评估
例如,使用`LMDeploy`部署的 `InternLM2-7B` 模型进行评估NeedleBench-4K的所有任务可以在命令行中直接使用以下命令该命令会调用预定义好的模型、数据集配置文件而无需额外书写配置文件
例如,使用`VLLM`部署的 `Qwen2-5-7B` 模型进行评估NeedleBench-128K的所有任务可以在命令行中直接使用以下命令该命令会调用预定义好的模型、数据集配置文件而无需额外书写配置文件
##### 本地评估
@ -74,7 +48,7 @@ pip install -e .
```bash
# 本地评估
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer
python run.py --dataset needlebench_v2_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer
```
##### 在Slurm集群上评估
@ -83,70 +57,42 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su
```bash
# Slurm评估
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
python run.py --dataset needlebench_v2_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
```
##### 只评估子数据集
如果只想测试原始的大海捞针任务设定,比如可以更换数据集的参数为`needlebench_single_4k`这对应于4k长度下的单针版本的大海捞针测试
如果只想测试原始的大海捞针任务设定,比如可以更换数据集的参数为`needlebench_single_128k`这对应于128k长度下的单针版本的大海捞针测试
```bash
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
python run.py --dataset needlebench_v2_single_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
```
您也可以进一步选择子数据集,如更换数据集`--datasets`的参数为`needlebench_single_4k/needlebench_zh_datasets`仅仅进行中文版本的单针4K长度下的大海捞针任务测试其中`/`后面的参数代表子数据集,您可以在`configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`中找到可选的子数据集变量,如:
您也可以进一步选择子数据集,如更换数据集`--datasets`的参数为`needlebench_single_128k/needlebench_zh_datasets`仅仅进行中文版本的单针128k长度下的大海捞针任务测试其中`/`后面的参数代表子数据集,您可以在`opencompass/configs/datasets/needlebench_v2/needlebench_v2_128k/needlebench_v2_single_128k.py`中找到可选的子数据集变量,如:
```bash
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
python run.py --dataset needlebench_v2_single_128k/needlebench_zh_datasets --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_v2_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
```
注意在评估前预先安装[LMDeploy](https://github.com/InternLM/lmdeploy)工具
注意在评估前预先安装[VLLM](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html)工具
```bash
pip install lmdeploy
# Install vLLM with CUDA 12.4.
# For other CUDA versions, please refer to the [official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html)
pip install vllm
```
这个命令将启动评估流程,参数 `-p partition_name -q auto``--max-num-workers 32` 用于指定 Slurm 分区名称和最大工作进程数。
这个命令将启动评估流程,其中参数 `-p partition_name` 用于指定 Slurm 分区名称,`-q auto` 用于指定 quota type资源队列类型例如 auto、reserved 等),`--max-num-workers 32` 用于设置最大工作进程数。
#### 评估其他`Huggingface`模型
对于其他模型,我们建议额外书写一个运行的配置文件以便对模型的`max_seq_len`, `max_out_len`参数进行修改,以便模型可以接收到完整的长文本内容。如我们预先写好的`configs/eval_needlebench.py`文件。完整内容如下
```python
from mmengine.config import read_base
# 我们使用mmengine.config来import其他的配置文件中的变量
with read_base():
# from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
# from .datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets
# from .summarizers.needlebench import needlebench_4k_summarizer as summarizer
# only eval original "needle in a haystack test" in needlebench_4k
from .datasets.needlebench.needlebench_4k.needlebench_single_4k import needlebench_zh_datasets, needlebench_en_datasets
from .summarizers.needlebench import needlebench_4k_summarizer as summarizer
# eval Ancestral Tracing Challenge(ATC)
# from .datasets.needlebench.atc.atc_choice_50 import needlebench_datasets
# from .summarizers.needlebench import atc_summarizer_50 as summarizer
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
for m in internlm2_chat_7b:
m['max_seq_len'] = 30768 # 保证InternLM2-7B模型能接收到完整的长文本其他模型需要根据各自支持的最大序列长度修改。
m['max_out_len'] = 2000 # 保证在多针召回任务中能接收到模型完整的回答
models = internlm2_chat_7b
work_dir = './outputs/needlebench'
```
对于其他模型,我们建议额外书写一个运行的配置文件以便对模型的`max_seq_len`, `max_out_len`参数进行修改,以便模型可以接收到完整的长文本内容。如`examples/eval_needlebench_v2.py`文件。
当书写好测试的`config`文件后,我们可以命令行中通过`run.py`文件传入对应的config文件路径例如
```bash
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16
python run.py configs/eval_needlebench_v2.py --slurm -p partition_name -q reserved --max-num-workers 16
```
注意,此时我们不需传入`--dataset, --models, --summarizer `等参数因为我们已经在config文件中定义了这些配置。你可以自己手动调节`--max-num-workers`的设定以调节并行工作的workers的数量。
@ -155,18 +101,20 @@ python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved
我们已经在最新的代码中将结果可视化内置到`summarizer`实现中您在对应的output文件夹的plots目录下可以看到相应的可视化。而不需要自己手动可视化各个深度和长度下的分数。
### 引用
如果使用了该方法,请添加引用:
```bibtex
@misc{li2024needlebenchllmsretrievalreasoning,
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?},
author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen},
year={2024},
@misc{li2025needlebenchllmsretrievalreasoning,
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?},
author={Mo Li and Songyang Zhang and Taolin Zhang and Haodong Duan and Yunxin Liu and Kai Chen},
year={2025},
eprint={2407.11963},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.11963},
url={https://arxiv.org/abs/2407.11963},
}
@misc{2023opencompass,

View File

@ -1,29 +0,0 @@
from mmengine.config import read_base
with read_base():
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
# from opencompass.configs.datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets
# from opencompass.configs.summarizers.needlebench import needlebench_4k_summarizer as summarizer
# only eval original "needle in a haystack test" in needlebench_4k
from opencompass.configs.datasets.needlebench.needlebench_4k.needlebench_single_4k import (
needlebench_en_datasets, needlebench_zh_datasets)
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import \
models as internlm2_chat_7b
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b import \
models as internlm2_chat_7b_200k
from opencompass.configs.summarizers.needlebench import \
needlebench_4k_summarizer as summarizer
# eval Ancestral Tracing Challenge(ATC)
# from opencompass.configs.datasets.needlebench.atc.atc_choice_50 import needlebench_datasets
# from opencompass.configs.summarizers.needlebench import atc_summarizer_50 as summarizer
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
for m in internlm2_chat_7b:
m['max_seq_len'] = 32768 # Ensure InternLM2-7B model can receive the full length of long texts, adjust for other models based on their supported maximum sequence length.
m['max_out_len'] = 2000 # Ensure complete responses from the model in multi-needle retrieval tasks.
models = internlm2_chat_7b
work_dir = './outputs/needlebench'

View File

@ -0,0 +1,27 @@
from mmengine.config import read_base
# we use mmengine.config to import other config files
with read_base():
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
# Evaluate needlebench_32k, adjust the configuration to use 4k, 32k, 128k, 200k, or 1000k if necessary.
# from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_32k import needlebench_datasets
# from opencompass.configs.summarizers.needlebench import needlebench_32k_summarizer as summarizer
# only eval original "needle in a haystack test" in needlebench_32k
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_single_32k import needlebench_zh_datasets, needlebench_en_datasets
from opencompass.configs.summarizers.needlebench import needlebench_v2_32k_summarizer as summarizer
# eval Ancestral Tracing Challenge(ATC)
# from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_datasets
# ATC use default summarizer thus no need to import summarizer
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
for m in internlm2_chat_7b:
m['max_seq_len'] = 32768 # 保证InternLM2-7B模型能接收到完整的长文本其他模型需要根据各自支持的最大序列长度修改。
m['max_out_len'] = 4096
models = internlm2_chat_7b
work_dir = './outputs/needlebench'

View File

@ -0,0 +1,55 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.needlebench_v2.atc import NeedleBenchATCDataset
from opencompass.datasets.needlebench_v2.atc import needlebench_atc_postprocess_v2
from opencompass.datasets.needlebench_v2.atc import NeedleBenchATCEvaluator
# ----------------------- Prompt Settings ----------------------- #
needle_num_list = [2, 4, 8, 16, 32, 64, 128, 256, 512]
path = 'opencompass/needlebench'
file_name = 'names.json'
repeats = 10
# ----------------------- Dataset Settings ----------------------- #
needlebench_datasets = []
needlebench_atc_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
needlebench_atc_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{prompt}'),
],
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(
type=GenInferencer,
),
)
needlebench_atc_eval_cfg = dict(
evaluator=dict(type=NeedleBenchATCEvaluator),
pred_postprocessor=dict(type=needlebench_atc_postprocess_v2),
)
for num_needles in needle_num_list:
abbr = f'NeedleBenchATCDataset-{num_needles}Needle-EN'
language = 'English'
dataset_dict = {
'abbr': abbr,
'type': NeedleBenchATCDataset,
'path': path,
'file_name': file_name,
'num_needles': num_needles,
'language': language,
'repeats': repeats,
'reader_cfg': needlebench_atc_reader_cfg,
'infer_cfg': needlebench_atc_infer_cfg,
'eval_cfg': needlebench_atc_eval_cfg,
}
needlebench_datasets.append(dataset_dict)

View File

@ -0,0 +1,18 @@
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_reasoning_1000k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_single_1000k import needlebench_en_datasets as needlebench_origin_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_single_1000k import needlebench_zh_datasets as needlebench_origin_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_retrieval_1000k import needlebench_en_datasets as needlebench_parallel_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_1000k.needlebench_v2_multi_retrieval_1000k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])

View File

@ -0,0 +1,93 @@
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_1000k import depths_list, context_lengths
from .needlebench_v2_single_1000k import needlebench_reader_cfg, needlebench_infer_cfg
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
# ----------English Version----------
base_path = 'opencompass/needlebench'
file_list = ['PaulGrahamEssays.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'English'
length_buffer = 3000
# Initialize dataset lists
needlebench_2needle_en_datasets = []
needlebench_3needle_en_datasets = []
needlebench_4needle_en_datasets = []
needlebench_5needle_en_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_en_1000k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
# ----------Chinese Version----------
base_path = 'opencompass/needlebench'
file_list = ['zh_finance.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'Chinese'
length_buffer = 200
# Initialize dataset lists
needlebench_2needle_zh_datasets = []
needlebench_3needle_zh_datasets = []
needlebench_4needle_zh_datasets = []
needlebench_5needle_zh_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_zh_1000k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)

View File

@ -0,0 +1,55 @@
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_1000k import depths_list as depths, context_lengths
from .needlebench_v2_single_1000k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 3000,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
dataset_dict = {
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_1000k',
'type': NeedleBenchParallelDataset,
'path': base_path,
'needle_file_name': needle_file_name,
'length': original_context_length,
'depths': depths,
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 25,
'length_buffer': config['length_buffer'],
'language': config['language'],
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,81 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
needlebench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{prompt}'),
dict(role='BOT', prompt='{answer}\n'),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
needlebench_eval_cfg = dict(
evaluator=dict(type=NeedleBenchOriginEvaluator),
pred_postprocessor=dict(type=needlebench_postprocess),
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
pred_role='BOT',
)
context_lengths = list([1000, 125000, 250000, 375000, 500000, 625000, 750000, 875000, 1000000])
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 3000,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_1000k',
'type': NeedleBenchOriginDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 10,
'length_buffer': config['length_buffer'],
'language': config['language'],
'needle_file_name': needle_file_name,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,32 @@
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_reasoning_128k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_single_128k import needlebench_en_datasets as needlebench_origin_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_single_128k import needlebench_zh_datasets as needlebench_origin_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_retrieval_128k import needlebench_en_datasets as needlebench_parallel_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_128k.needlebench_v2_multi_retrieval_128k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])
if __name__ == '__main__':
print(len(needlebench_datasets))
# sum num_repeats_per_file of all datasets
num_repeats_per_file = sum(dataset['num_repeats_per_file'] for dataset in needlebench_datasets) * 8
print(num_repeats_per_file)
# every repeat is 5 seconds
print(num_repeats_per_file * 5 / 60, 'minutes')
# print number of hours
print(num_repeats_per_file * 5 / 3600, 'hours')
# if every repeat is 2 minutes, how many days
print(num_repeats_per_file * 2 / 60 / 24, 'days')

View File

@ -0,0 +1,93 @@
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_128k import depths_list, context_lengths
from .needlebench_v2_single_128k import needlebench_reader_cfg, needlebench_infer_cfg
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
# ----------English Version----------
base_path = 'opencompass/needlebench'
file_list = ['PaulGrahamEssays.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'English'
length_buffer = 3000
# Initialize dataset lists
needlebench_2needle_en_datasets = []
needlebench_3needle_en_datasets = []
needlebench_4needle_en_datasets = []
needlebench_5needle_en_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_en_128k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
# ----------Chinese Version----------
base_path = 'opencompass/needlebench'
file_list = ['zh_finance.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'Chinese'
length_buffer = 200
# Initialize dataset lists
needlebench_2needle_zh_datasets = []
needlebench_3needle_zh_datasets = []
needlebench_4needle_zh_datasets = []
needlebench_5needle_zh_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_zh_128k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)

View File

@ -0,0 +1,55 @@
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_128k import depths_list as depths, context_lengths
from .needlebench_v2_single_128k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 3000,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
dataset_dict = {
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_128k',
'type': NeedleBenchParallelDataset,
'path': base_path,
'needle_file_name': needle_file_name,
'length': original_context_length,
'depths': depths,
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 25,
'length_buffer': config['length_buffer'],
'language': config['language'],
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,82 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
needlebench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{prompt}'),
dict(role='BOT', prompt='{answer}\n'),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
needlebench_eval_cfg = dict(
evaluator=dict(type=NeedleBenchOriginEvaluator),
pred_postprocessor=dict(type=needlebench_postprocess),
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
pred_role='BOT',
)
context_lengths = list([1000, 2000, 4000, 8000, 16000, 32000, 64000, 128000])
# context_lengths = [128000]
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 3000,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_128k',
'type': NeedleBenchOriginDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 10,
'length_buffer': config['length_buffer'],
'language': config['language'],
'needle_file_name': needle_file_name,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,18 @@
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_reasoning_200k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_single_200k import needlebench_en_datasets as needlebench_origin_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_single_200k import needlebench_zh_datasets as needlebench_origin_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_retrieval_200k import needlebench_en_datasets as needlebench_parallel_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_200k.needlebench_v2_multi_retrieval_200k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])

View File

@ -0,0 +1,93 @@
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_200k import depths_list, context_lengths
from .needlebench_v2_single_200k import needlebench_reader_cfg, needlebench_infer_cfg
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
# ----------English Version----------
base_path = 'opencompass/needlebench'
file_list = ['PaulGrahamEssays.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'English'
length_buffer = 3000
# Initialize dataset lists
needlebench_2needle_en_datasets = []
needlebench_3needle_en_datasets = []
needlebench_4needle_en_datasets = []
needlebench_5needle_en_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_en_200k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
# ----------Chinese Version----------
base_path = 'opencompass/needlebench'
file_list = ['zh_finance.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'Chinese'
length_buffer = 200
# Initialize dataset lists
needlebench_2needle_zh_datasets = []
needlebench_3needle_zh_datasets = []
needlebench_4needle_zh_datasets = []
needlebench_5needle_zh_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_zh_200k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)

View File

@ -0,0 +1,55 @@
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_200k import depths_list as depths, context_lengths
from .needlebench_v2_single_200k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 3000,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
dataset_dict = {
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_200k',
'type': NeedleBenchParallelDataset,
'path': base_path,
'needle_file_name': needle_file_name,
'length': original_context_length,
'depths': depths,
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 25,
'length_buffer': config['length_buffer'],
'language': config['language'],
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,81 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
needlebench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{prompt}'),
dict(role='BOT', prompt='{answer}\n'),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
needlebench_eval_cfg = dict(
evaluator=dict(type=NeedleBenchOriginEvaluator),
pred_postprocessor=dict(type=needlebench_postprocess),
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
pred_role='BOT',
)
context_lengths = list([1000, 25000, 50000, 75000, 100000, 125000, 150000, 175000, 200000])
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 3000,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_200k',
'type': NeedleBenchOriginDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 10,
'length_buffer': config['length_buffer'],
'language': config['language'],
'needle_file_name': needle_file_name,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,18 @@
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_reasoning_256k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_single_256k import needlebench_en_datasets as needlebench_origin_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_single_256k import needlebench_zh_datasets as needlebench_origin_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_retrieval_256k import needlebench_en_datasets as needlebench_parallel_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_256k.needlebench_v2_multi_retrieval_256k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])

View File

@ -0,0 +1,93 @@
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_256k import depths_list, context_lengths
from .needlebench_v2_single_256k import needlebench_reader_cfg, needlebench_infer_cfg
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
# ----------English Version----------
base_path = 'opencompass/needlebench'
file_list = ['PaulGrahamEssays.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'English'
length_buffer = 3000
# Initialize dataset lists
needlebench_2needle_en_datasets = []
needlebench_3needle_en_datasets = []
needlebench_4needle_en_datasets = []
needlebench_5needle_en_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_en_256k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
# ----------Chinese Version----------
base_path = 'opencompass/needlebench'
file_list = ['zh_finance.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'Chinese'
length_buffer = 200
# Initialize dataset lists
needlebench_2needle_zh_datasets = []
needlebench_3needle_zh_datasets = []
needlebench_4needle_zh_datasets = []
needlebench_5needle_zh_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_zh_256k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)

View File

@ -0,0 +1,55 @@
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_256k import depths_list as depths, context_lengths
from .needlebench_v2_single_256k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 3000,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
dataset_dict = {
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_256k',
'type': NeedleBenchParallelDataset,
'path': base_path,
'needle_file_name': needle_file_name,
'length': original_context_length,
'depths': depths,
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 25,
'length_buffer': config['length_buffer'],
'language': config['language'],
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,81 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
needlebench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{prompt}'),
dict(role='BOT', prompt='{answer}\n'),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
needlebench_eval_cfg = dict(
evaluator=dict(type=NeedleBenchOriginEvaluator),
pred_postprocessor=dict(type=needlebench_postprocess),
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
pred_role='BOT',
)
context_lengths = [32000, 128000, 256000]
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 3000,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_256k',
'type': NeedleBenchOriginDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 10,
'length_buffer': config['length_buffer'],
'language': config['language'],
'needle_file_name': needle_file_name,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,18 @@
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_reasoning_32k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_single_32k import needlebench_en_datasets as needlebench_origin_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_single_32k import needlebench_zh_datasets as needlebench_origin_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_retrieval_32k import needlebench_en_datasets as needlebench_parallel_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_32k.needlebench_v2_multi_retrieval_32k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])

View File

@ -0,0 +1,93 @@
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_32k import depths_list, context_lengths
from .needlebench_v2_single_32k import needlebench_reader_cfg, needlebench_infer_cfg
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
# ----------English Version----------
base_path = 'opencompass/needlebench'
file_list = ['PaulGrahamEssays.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'English'
length_buffer = 3000
# Initialize dataset lists
needlebench_2needle_en_datasets = []
needlebench_3needle_en_datasets = []
needlebench_4needle_en_datasets = []
needlebench_5needle_en_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_en_32k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
# ----------Chinese Version----------
base_path = 'opencompass/needlebench'
file_list = ['zh_finance.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'Chinese'
length_buffer = 200
# Initialize dataset lists
needlebench_2needle_zh_datasets = []
needlebench_3needle_zh_datasets = []
needlebench_4needle_zh_datasets = []
needlebench_5needle_zh_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_zh_32k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)

View File

@ -0,0 +1,55 @@
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_32k import depths_list as depths, context_lengths
from .needlebench_v2_single_32k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 3000,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
dataset_dict = {
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_32k',
'type': NeedleBenchParallelDataset,
'path': base_path,
'needle_file_name': needle_file_name,
'length': original_context_length,
'depths': depths,
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 25,
'length_buffer': config['length_buffer'],
'language': config['language'],
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,81 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
needlebench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{prompt}'),
dict(role='BOT', prompt='{answer}\n'),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
needlebench_eval_cfg = dict(
evaluator=dict(type=NeedleBenchOriginEvaluator),
pred_postprocessor=dict(type=needlebench_postprocess),
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
pred_role='BOT',
)
context_lengths = list([1000, 4000, 8000, 12000, 16000, 20000, 24000, 28000, 32000])
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 3000,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_32k',
'type': NeedleBenchOriginDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 10,
'length_buffer': config['length_buffer'],
'language': config['language'],
'needle_file_name': needle_file_name,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,18 @@
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_reasoning_4k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_single_4k import needlebench_en_datasets as needlebench_origin_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_single_4k import needlebench_zh_datasets as needlebench_origin_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_retrieval_4k import needlebench_en_datasets as needlebench_parallel_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_4k.needlebench_v2_multi_retrieval_4k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])

View File

@ -0,0 +1,93 @@
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_4k import depths_list, context_lengths
from .needlebench_v2_single_4k import needlebench_reader_cfg, needlebench_infer_cfg
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
# ----------English Version----------
base_path = 'opencompass/needlebench'
file_list = ['PaulGrahamEssays.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'English'
length_buffer = 500
# Initialize dataset lists
needlebench_2needle_en_datasets = []
needlebench_3needle_en_datasets = []
needlebench_4needle_en_datasets = []
needlebench_5needle_en_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_en_4k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
# ----------Chinese Version----------
base_path = 'opencompass/needlebench'
file_list = ['zh_finance.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'Chinese'
length_buffer = 200
# Initialize dataset lists
needlebench_2needle_zh_datasets = []
needlebench_3needle_zh_datasets = []
needlebench_4needle_zh_datasets = []
needlebench_5needle_zh_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_zh_4k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)

View File

@ -0,0 +1,55 @@
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_4k import depths_list as depths, context_lengths
from .needlebench_v2_single_4k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 500,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
dataset_dict = {
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_4k',
'type': NeedleBenchParallelDataset,
'path': base_path,
'needle_file_name': needle_file_name,
'length': original_context_length,
'depths': depths,
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 25,
'length_buffer': config['length_buffer'],
'language': config['language'],
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,81 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
needlebench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{prompt}'),
dict(role='BOT', prompt='{answer}\n'),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
needlebench_eval_cfg = dict(
evaluator=dict(type=NeedleBenchOriginEvaluator),
pred_postprocessor=dict(type=needlebench_postprocess),
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
pred_role='BOT',
)
context_lengths = list([1000, 2000, 3000, 4000])
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 500,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_4k',
'type': NeedleBenchOriginDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 10,
'length_buffer': config['length_buffer'],
'language': config['language'],
'needle_file_name': needle_file_name,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,18 @@
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_2needle_en_datasets as needlebench_multi_2needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_3needle_en_datasets as needlebench_multi_3needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_4needle_en_datasets as needlebench_multi_4needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_5needle_en_datasets as needlebench_multi_5needle_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_2needle_zh_datasets as needlebench_multi_2needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_3needle_zh_datasets as needlebench_multi_3needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_4needle_zh_datasets as needlebench_multi_4needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_reasoning_8k import needlebench_5needle_zh_datasets as needlebench_multi_5needle_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_single_8k import needlebench_en_datasets as needlebench_origin_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_single_8k import needlebench_zh_datasets as needlebench_origin_zh_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_retrieval_8k import needlebench_en_datasets as needlebench_parallel_en_datasets
from opencompass.configs.datasets.needlebench_v2.needlebench_v2_8k.needlebench_v2_multi_retrieval_8k import needlebench_zh_datasets as needlebench_parallel_zh_datasets
needlebench_datasets = sum((v for k, v in locals().items() if k.endswith('_datasets')), [])

View File

@ -0,0 +1,93 @@
from opencompass.datasets.needlebench_v2.multi import NeedleBenchMultiDataset
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_8k import depths_list, context_lengths
from .needlebench_v2_single_8k import needlebench_reader_cfg, needlebench_infer_cfg
from opencompass.configs.datasets.needlebench_v2.atc.atc_0shot_nocot_2_power_en import needlebench_atc_eval_cfg as needlebench_eval_cfg
# ----------English Version----------
base_path = 'opencompass/needlebench'
file_list = ['PaulGrahamEssays.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'English'
length_buffer = 500
# Initialize dataset lists
needlebench_2needle_en_datasets = []
needlebench_3needle_en_datasets = []
needlebench_4needle_en_datasets = []
needlebench_5needle_en_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_en_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_en_8k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_en_datasets'].append(dataset_dict)
# ----------Chinese Version----------
base_path = 'opencompass/needlebench'
file_list = ['zh_finance.jsonl']
needle_file_name = 'names.json'
diff = 10
language = 'Chinese'
length_buffer = 200
# Initialize dataset lists
needlebench_2needle_zh_datasets = []
needlebench_3needle_zh_datasets = []
needlebench_4needle_zh_datasets = []
needlebench_5needle_zh_datasets = []
# Create datasets for different numbers of needles
for num_needles in range(2, 6):
dataset_list_name = f'needlebench_{num_needles}needle_zh_datasets'
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_{num_needles}needle_zh_8k',
'type': NeedleBenchMultiDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 10,
'length_buffer': length_buffer,
'language': language,
'needle_file_name': needle_file_name,
'num_needles': num_needles,
'diff': diff,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
# Add to the appropriate list using globals()
globals()[f'needlebench_{num_needles}needle_zh_datasets'].append(dataset_dict)

View File

@ -0,0 +1,55 @@
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
from mmengine.config import read_base
with read_base():
from .needlebench_v2_single_8k import depths_list as depths, context_lengths
from .needlebench_v2_single_8k import needlebench_reader_cfg, needlebench_infer_cfg, needlebench_eval_cfg
needlebench_eval_cfg['evaluator']['type'] = NeedleBenchParallelEvaluator
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 500,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
dataset_dict = {
'abbr': f'Length{original_context_length}_parallel_{config["suffix"]}_8k',
'type': NeedleBenchParallelDataset,
'path': base_path,
'needle_file_name': needle_file_name,
'length': original_context_length,
'depths': depths,
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 25,
'length_buffer': config['length_buffer'],
'language': config['language'],
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,122 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelDataset
from opencompass.datasets.needlebench_v2.parallel import NeedleBenchParallelEvaluator
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
import math
def logistic(x, L=100, x0=50, k=0.1):
return round(L / (1 + math.exp(-k * (x - x0))), 3)
def generate_linear_space(start, end, num):
if num == 1:
return [start]
elif num < 1:
raise ValueError('num must be at least 1.')
step = (end - start) / (num - 1)
return [start + step * i for i in range(num)]
def generate_depth_percents(intervals, interval_type):
if interval_type == 'linear':
return generate_linear_space(0, 100, intervals)
elif interval_type == 'sigmoid':
linear_space = generate_linear_space(0, 100, intervals)
return [logistic(x) for x in linear_space]
else:
raise ValueError('Unsupported interval type')
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
needlebench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{prompt}'),
dict(role='BOT', prompt='{answer}\n'),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
needlebench_eval_cfg = dict(
evaluator=dict(type=NeedleBenchParallelEvaluator),
pred_postprocessor=dict(type=needlebench_postprocess),
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
pred_role='BOT',
)
context_lengths = list(range(5000, 9000, 1000))
document_depth_percent_intervals_list = [1, 5, 10, 15, 20]
document_depth_percent_interval_type = 'linear'
base_path = 'opencompass/needlebench'
file_list = ['PaulGrahamEssays.jsonl']
needlebench_en_datasets = []
needle_file_name = 'needles.jsonl'
for document_depth_percent_intervals in document_depth_percent_intervals_list:
depths_float = generate_depth_percents(
document_depth_percent_intervals, document_depth_percent_interval_type
)
depths = [int(depth) for depth in depths_float]
for original_context_length in context_lengths:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'_parallel_en_8k_batch{document_depth_percent_intervals}',
'type': NeedleBenchParallelDataset,
'path': base_path,
'needle_file_name': needle_file_name,
'length': original_context_length,
'depths': depths,
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 50,
'length_buffer': 1300,
'guide': True,
'language': 'English',
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
needlebench_en_datasets.append(dataset_dict)
file_list = ['zh_finance.jsonl']
needlebench_zh_datasets = []
needle_file_name = 'needles.jsonl'
for document_depth_percent_intervals in document_depth_percent_intervals_list:
depths_float = generate_depth_percents(
document_depth_percent_intervals, document_depth_percent_interval_type
)
depths = [int(depth) for depth in depths_float]
for original_context_length in context_lengths:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'_parallel_zh_8k_batch{document_depth_percent_intervals}',
'type': NeedleBenchParallelDataset,
'path': base_path,
'needle_file_name': needle_file_name,
'length': original_context_length,
'depths': depths,
'tokenizer_model': 'gpt-4',
'file_list': file_list,
'num_repeats_per_file': 50,
'length_buffer': 200,
'guide': True,
'language': 'Chinese',
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
needlebench_zh_datasets.append(dataset_dict)

View File

@ -0,0 +1,81 @@
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import ZeroRetriever
from opencompass.openicl.icl_inferencer import GenInferencer
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginDataset
from opencompass.datasets.needlebench_v2.origin import NeedleBenchOriginEvaluator
from opencompass.datasets.needlebench_v2.origin import needlebench_postprocess
from opencompass.datasets.needlebench_v2.origin import needlebench_dataset_postprocess
needlebench_reader_cfg = dict(input_columns=['prompt'], output_column='answer')
needlebench_infer_cfg = dict(
prompt_template=dict(
type=PromptTemplate,
template=dict(
round=[
dict(role='HUMAN', prompt='{prompt}'),
dict(role='BOT', prompt='{answer}\n'),
]
),
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=GenInferencer),
)
needlebench_eval_cfg = dict(
evaluator=dict(type=NeedleBenchOriginEvaluator),
pred_postprocessor=dict(type=needlebench_postprocess),
dataset_postprocessor=dict(type=needlebench_dataset_postprocess),
pred_role='BOT',
)
context_lengths = list([1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000])
depths_list = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
base_path = 'opencompass/needlebench'
needle_file_name = 'needles.jsonl'
# Define configurations for both English and Chinese datasets
language_configs = [
{
'file_list': ['PaulGrahamEssays.jsonl'],
'dataset_var': 'needlebench_en_datasets',
'language': 'English',
'length_buffer': 500,
'suffix': 'en'
},
{
'file_list': ['zh_finance.jsonl'],
'dataset_var': 'needlebench_zh_datasets',
'language': 'Chinese',
'length_buffer': 200,
'suffix': 'zh'
}
]
# Initialize empty dataset lists
needlebench_en_datasets = []
needlebench_zh_datasets = []
# Single loop to handle both languages
for config in language_configs:
for original_context_length in context_lengths:
for depth_percent in depths_list:
dataset_dict = {
'abbr': f'Length{original_context_length}'
f'Depth{int(depth_percent)}_origin_{config["suffix"]}_8k',
'type': NeedleBenchOriginDataset,
'path': base_path,
'length': original_context_length,
'depth': int(depth_percent),
'tokenizer_model': 'gpt-4',
'file_list': config['file_list'],
'num_repeats_per_file': 10,
'length_buffer': config['length_buffer'],
'language': config['language'],
'needle_file_name': needle_file_name,
'reader_cfg': needlebench_reader_cfg,
'infer_cfg': needlebench_infer_cfg,
'eval_cfg': needlebench_eval_cfg,
}
globals()[config['dataset_var']].append(dataset_dict)

View File

@ -0,0 +1,69 @@
# NeedleBench V2: An Enhanced Benchmark for Needle-In-A-Haystack Evaluations
English | [简体中文](readme_zh-CN.md)
## Overview
NeedleBench V2 is an improved benchmark that rigorously assesses the information retrieval and reasoning capabilities of large language models (LLMs) in long-context scenarios. Building upon the original NeedleBench, this version introduces significant enhancements to provide more accurate and unbiased evaluations of LLMs' abilities to locate and reason with critical information in extensive texts.
### Directory Structure
```
configs/datasets/needlebench_v2/
├── atc
├── needlebench_v2_4k
├── needlebench_v2_8k
├── needlebench_v2_32k
├── needlebench_v2_128k
├── needlebench_v2_200k
├── needlebench_v2_256k
├── needlebench_v2_1000k
├── readme.md
└── readme_zh-CN.md
```
Within each configuration directory (e.g., `needlebench_v2_4k`), there are configuration files tailored for testing within that specific length setting.
## Task Descriptions and Length Configurations
NeedleBench V2 offers tasks in various length configurations (4k, 8k, 32k, 128k, 200k, 256k, 1000k) to accommodate different scales of language model evaluation needs. Each length configuration provides specialized test scripts for the following tasks:
### Single-Needle Retrieval
The Single-Needle Retrieval task evaluates LLMs' ability to recall a single piece of crucial information from a haystack text of a specific length. This task assesses the model's precision in identifying and recalling specific information from extended texts.
### Multi-Needle Retrieval
The Multi-Needle Retrieval task challenges LLMs' ability to identify and extract multiple key information points from extensive texts. It simulates real-world scenarios where multiple data points, facts, or figures need to be retrieved from documents or reports, evaluating the model's efficiency in navigating and extracting relevant information from dense texts.
### Multi-Needle Reasoning
In NeedleBench V2, the Multi-Needle Reasoning task has been significantly improved. The original needles based on the R4C/MultiHop dataset have been replaced with fictional information similar to those in the Ancestral Trace Challenge. This change addresses potential biases from innate knowledge, as the original dataset may have been included in some models' training data. The task continues to evaluate LLMs' capacity for complex reasoning with retrieved information, requiring models to not only recall multiple pieces of information but also engage in logical reasoning.
### Ancestral Trace Challenge (ATC)
The Ancestral Trace Challenge has been refined in NeedleBench V2. The needle distribution pattern has changed from a dense form (1, 2, 3, 4, 5 needles) to a sparse form based on powers of 2 (2¹, 2², 2³, etc.). This task remains NeedleBench's most complex, requiring models to recall and analyze every detail in long texts for problems demanding an understanding of complex relationships, such as genealogical inquiries or detailed case analysis.
## Scoring Methodology
NeedleBench V2 introduces a more balanced scoring system. The overall score is now calculated as a simple average of the three main tasks (Single-Needle Retrieval, Multi-Needle Retrieval, and Multi-Needle Reasoning), with each task receiving equal weight. This change from the previous weighted average approach provides a more straightforward and equitable assessment of model capabilities across different retrieval and reasoning tasks.
## Prompt Enhancements
All prompts in NeedleBench V2 have been refined for greater clarity and effectiveness, with particular attention to the ATC experiment prompts. The configuration structure has also been streamlined for easier use and interpretation.
## Citation
If you use NeedleBench V2 in your research, please cite:
```bibtex
@misc{li2025needlebenchllmsretrievalreasoning,
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?},
author={Mo Li and Songyang Zhang and Taolin Zhang and Haodong Duan and Yunxin Liu and Kai Chen},
year={2025},
eprint={2407.11963},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.11963},
}
```

View File

@ -0,0 +1,69 @@
# NeedleBench V2改进版大海捞针测试评估基准
[English](readme.md) | 简体中文
## 概览
NeedleBench V2是一个改进版基准测试旨在严格评估大型语言模型LLMs在长文本场景中的信息检索和推理能力。在原有NeedleBench的基础上这个版本引入了重要的增强功能为LLMs在海量文本中定位和推理关键信息的能力提供更准确、更公正的评估。
### 目录结构
```
configs/datasets/needlebench_v2/
├── atc
├── needlebench_v2_4k
├── needlebench_v2_8k
├── needlebench_v2_32k
├── needlebench_v2_128k
├── needlebench_v2_200k
├── needlebench_v2_256k
├── needlebench_v2_1000k
├── readme.md
└── readme_zh-CN.md
```
在每个长度配置目录下(如 `needlebench_v2_4k`),包含了专门针对该长度设置的测试任务配置文件。
## 任务描述与长度配置
NeedleBench V2提供了不同长度配置的任务4k、8k、32k、128k、200k、256k、1000k以适应不同规模的语言模型评估需求。每种长度配置针对以下任务提供了专门的测试脚本
### 单针信息检索
单针信息检索任务评估LLMs从特定长度的无关信息文本中回忆单个重要信息的能力。这个任务评估模型在长文本中识别和回忆特定信息的精确性。
### 多针信息检索
多针信息检索任务挑战LLMs识别和提取广泛文本中的多个关键信息点的能力。它模拟了现实世界中的场景其中需要从文档或报告中检索多个数据点、事实或数字评估模型在浏览和从密集文本中提取相关信息的效率。
### 多针信息推理
在NeedleBench V2中多针信息推理任务得到了显著改进。原来基于R4C/MultiHop数据集的"针"已被替换为类似于祖源追溯挑战中的虚构信息。这一改变解决了潜在的内生知识偏差问题因为原始数据集可能已被包含在一些模型的训练数据中。这个任务继续评估LLMs使用检索到的信息进行复杂推理的能力要求模型不仅能回忆多个信息点还能进行逻辑推理。
### 祖源追溯挑战 (ATC)
祖源追溯挑战在NeedleBench V2中进行了优化。针的分布模式从密集形式1、2、3、4、5针变为基于2的幂次的稀疏形式2¹、2²、2³等。这个任务仍然是NeedleBench中最复杂的任务要求模型回忆和分析长文本中的每个细节以解决需要理解复杂关系的问题如家谱查询或详细案例分析。
## 评分方法
NeedleBench V2引入了更平衡的评分系统。总体评分现在是通过三个主要任务单针信息检索、多针信息检索和多针信息推理的简单平均值计算得出每个任务获得相等的权重。这一改变从先前的加权平均方法提供了一种更直接、更公平的方式评估模型在不同检索和推理任务中的能力。
## 提示增强
NeedleBench V2中的所有提示都经过了改进以提高清晰度和有效性特别关注了ATC实验的提示。配置结构也进行了精简使其更易于使用和理解。
## 引用
如果您在研究中使用NeedleBench V2请引用
```bibtex
@misc{li2025needlebenchllmsretrievalreasoning,
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in Information-Dense Context?},
author={Mo Li and Songyang Zhang and Taolin Zhang and Haodong Duan and Yunxin Liu and Kai Chen},
year={2025},
eprint={2407.11963},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.11963},
}
```

View File

@ -1,4 +1,5 @@
from opencompass.summarizers.needlebench import NeedleBenchSummarizer
from opencompass.summarizers.needlebench import NeedleBenchSummarizer, NeedleBenchSummarizerV2
def create_m_rs_names_list(context_lengths, depths, needle_counts,
@ -30,7 +31,7 @@ def create_m_rs_names_list(context_lengths, depths, needle_counts,
return names_dict
def create_summarizer(context_lengths, depths, dataset_size,
sparse_depths=None):
sparse_depths=None, mean=False):
needle_counts = ['2', '3', '4', '5']
languages = ['en', 'zh']
if sparse_depths:
@ -81,17 +82,26 @@ def create_summarizer(context_lengths, depths, dataset_size,
summary_groups = [
{'name': key, 'subsets': value} for key, value in names_dict.items()
]
summary_groups.append({
'name': f'NeedleBench-Overall-Score-{dataset_size.upper()}',
'subsets': [[f'Single-Needle-Retrieval(S-RT)-{dataset_size.upper()}', 'naive_average'],
[f'Multi-Needle-Reasoning(M-RS)-{dataset_size.upper()}', 'naive_average'],
[f'Multi-Needle-Retrieval(M-RT)-{dataset_size.upper()}', 'average_score']],
'weights': {f'Single-Needle-Retrieval(S-RT)-{dataset_size.upper()}': 0.4,
f'Multi-Needle-Reasoning(M-RS)-{dataset_size.upper()}': 0.3,
f'Multi-Needle-Retrieval(M-RT)-{dataset_size.upper()}': 0.3}})
if mean:
summary_groups.append({
'name': f'NeedleBench-Overall-Score-{dataset_size.upper()}',
'subsets': [[f'Single-Needle-Retrieval(S-RT)-{dataset_size.upper()}', 'naive_average'],
[f'Multi-Needle-Reasoning(M-RS)-{dataset_size.upper()}', 'naive_average'],
[f'Multi-Needle-Retrieval(M-RT)-{dataset_size.upper()}', 'average_score']],
'weights': {f'Single-Needle-Retrieval(S-RT)-{dataset_size.upper()}': 1/3,
f'Multi-Needle-Reasoning(M-RS)-{dataset_size.upper()}': 1/3,
f'Multi-Needle-Retrieval(M-RT)-{dataset_size.upper()}': 1/3}})
else:
summary_groups.append({
'name': f'NeedleBench-Overall-Score-{dataset_size.upper()}',
'subsets': [[f'Single-Needle-Retrieval(S-RT)-{dataset_size.upper()}', 'naive_average'],
[f'Multi-Needle-Reasoning(M-RS)-{dataset_size.upper()}', 'naive_average'],
[f'Multi-Needle-Retrieval(M-RT)-{dataset_size.upper()}', 'average_score']],
'weights': {f'Single-Needle-Retrieval(S-RT)-{dataset_size.upper()}': 0.4,
f'Multi-Needle-Reasoning(M-RS)-{dataset_size.upper()}': 0.3,
f'Multi-Needle-Retrieval(M-RT)-{dataset_size.upper()}': 0.3}})
summarizer_config = {
'type': NeedleBenchSummarizer,
'type': NeedleBenchSummarizerV2 if mean else NeedleBenchSummarizer,
'summary_groups': summary_groups,
'dataset_abbrs': [
f'NeedleBench-Overall-Score-{dataset_size.upper()}',
@ -143,177 +153,20 @@ needlebench_internal_32k_summarizer = create_summarizer([32000], depths_list_int
needlebench_internal_100k_summarizer = create_summarizer([100000], depths_list_internal, '100000')
needlebench_internal_200k_summarizer = create_summarizer([200000], depths_list_internal, '200000')
_needlebench_8k_parallel_en_batch1 = []
_needlebench_8k_parallel_en_batch5 = []
_needlebench_8k_parallel_en_batch10 = []
_needlebench_8k_parallel_en_batch15 = []
_needlebench_8k_parallel_en_batch20 = []
_needlebench_8k_parallel_zh_batch1 = []
_needlebench_8k_parallel_zh_batch5 = []
_needlebench_8k_parallel_zh_batch10 = []
_needlebench_8k_parallel_zh_batch15 = []
_needlebench_8k_parallel_zh_batch20 = []
for original_context_length in context_lengths_8k:
_needlebench_8k_parallel_en_batch1.append(f'Length{original_context_length}_parallel_en_8k_batch1')
_needlebench_8k_parallel_en_batch5.append(f'Length{original_context_length}_parallel_en_8k_batch5')
_needlebench_8k_parallel_en_batch10.append(f'Length{original_context_length}_parallel_en_8k_batch10')
_needlebench_8k_parallel_en_batch15.append(f'Length{original_context_length}_parallel_en_8k_batch15')
_needlebench_8k_parallel_en_batch20.append(f'Length{original_context_length}_parallel_en_8k_batch20')
_needlebench_8k_parallel_zh_batch1.append(f'Length{original_context_length}_parallel_zh_8k_batch1')
_needlebench_8k_parallel_zh_batch5.append(f'Length{original_context_length}_parallel_zh_8k_batch5')
_needlebench_8k_parallel_zh_batch10.append(f'Length{original_context_length}_parallel_zh_8k_batch10')
_needlebench_8k_parallel_zh_batch15.append(f'Length{original_context_length}_parallel_zh_8k_batch15')
_needlebench_8k_parallel_zh_batch20.append(f'Length{original_context_length}_parallel_zh_8k_batch20')
depths_list_20 = [i for i in range(0, 101, 5)] # [0, 5, 10, ..., 100]
depths_list_10 = [i for i in range(0, 101, 10)] # [0, 10, 20, ..., 100]
_needlebench_8k_parallel_batch1 = _needlebench_8k_parallel_en_batch1 + _needlebench_8k_parallel_zh_batch1
_needlebench_8k_parallel_batch5 = _needlebench_8k_parallel_en_batch5 + _needlebench_8k_parallel_zh_batch5
_needlebench_8k_parallel_batch10 = _needlebench_8k_parallel_en_batch10 + _needlebench_8k_parallel_zh_batch10
_needlebench_8k_parallel_batch15 = _needlebench_8k_parallel_en_batch15 + _needlebench_8k_parallel_zh_batch15
_needlebench_8k_parallel_batch20 = _needlebench_8k_parallel_en_batch20 + _needlebench_8k_parallel_zh_batch20
needlebench_summary_groups = [
{'name': 'parallel_version_batch1', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_batch1]},
{'name': 'parallel_version_zh_batch1', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_zh_batch1]},
{'name': 'parallel_version_en_batch1', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_en_batch1]},
{'name': 'parallel_version_batch5', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_batch5]},
{'name': 'parallel_version_zh_batch5', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_zh_batch5]},
{'name': 'parallel_version_en_batch5', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_en_batch5]},
{'name': 'parallel_version_batch10', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_batch10]},
{'name': 'parallel_version_zh_batch10', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_zh_batch10]},
{'name': 'parallel_version_en_batch10', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_en_batch10]},
{'name': 'parallel_version_batch15', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_batch15]},
{'name': 'parallel_version_zh_batch15', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_zh_batch15]},
{'name': 'parallel_version_en_batch15', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_en_batch15]},
{'name': 'parallel_version_batch20', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_batch20]},
{'name': 'parallel_version_zh_batch20', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_zh_batch20]},
{'name': 'parallel_version_en_batch20', 'subsets': [[_dataset, 'average_score'] for _dataset in _needlebench_8k_parallel_en_batch20]},
]
needlebench_8k_batch_overall_summarizer = dict(
dataset_abbrs=[
'--------- NeedleBench-8k Parallel-Needles ---------', # category
'parallel_version_batch1',
'parallel_version_batch5',
'parallel_version_batch10',
'parallel_version_batch15',
'parallel_version_batch20',
'parallel_version_zh_batch1',
'parallel_version_en_batch1',
'parallel_version_zh_batch5',
'parallel_version_en_batch5',
'parallel_version_zh_batch10',
'parallel_version_en_batch10',
'parallel_version_zh_batch15',
'parallel_version_en_batch15',
'parallel_version_zh_batch20',
'parallel_version_en_batch20',
],
summary_groups=needlebench_summary_groups,
)
needlebench_summary_groups = [
{'name': 'parallel_version_batch1', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_batch1]},
{'name': 'parallel_version_zh_batch1', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_zh_batch1]},
{'name': 'parallel_version_en_batch1', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_en_batch1]},
{'name': 'parallel_version_batch5', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_batch5]},
{'name': 'parallel_version_zh_batch5', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_zh_batch5]},
{'name': 'parallel_version_en_batch5', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_en_batch5]},
{'name': 'parallel_version_batch10', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_batch10]},
{'name': 'parallel_version_zh_batch10', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_zh_batch10]},
{'name': 'parallel_version_en_batch10', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_en_batch10]},
{'name': 'parallel_version_batch15', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_batch15]},
{'name': 'parallel_version_zh_batch15', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_zh_batch15]},
{'name': 'parallel_version_en_batch15', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_en_batch15]},
{'name': 'parallel_version_batch20', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_batch20]},
{'name': 'parallel_version_zh_batch20', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_zh_batch20]},
{'name': 'parallel_version_en_batch20', 'subsets': [[_dataset, 'Depth0'] for _dataset in _needlebench_8k_parallel_en_batch20]},
]
needlebench_8k_batch_depth0_summarizer = dict(
dataset_abbrs=[
'--------- NeedleBench-8k Parallel-Needles ---------', # category
'parallel_version_batch1',
'parallel_version_batch5',
'parallel_version_batch10',
'parallel_version_batch15',
'parallel_version_batch20',
'parallel_version_zh_batch1',
'parallel_version_en_batch1',
'parallel_version_zh_batch5',
'parallel_version_en_batch5',
'parallel_version_zh_batch10',
'parallel_version_en_batch10',
'parallel_version_zh_batch15',
'parallel_version_en_batch15',
'parallel_version_zh_batch20',
'parallel_version_en_batch20',
],
summary_groups=needlebench_summary_groups,
)
def gen_atc_summarizer(needle_num_list):
categories = [
'ZH-Direct-CE', 'EN-Direct-CE',
'ZH-Reasoning-CE', 'EN-Reasoning-CE'
]
needlebench_atc_summary_groups = []
# 根据分类生成summary groups
for category in categories:
# 对于CircularEval相关的评分使用perf_4指标否则使用acc_1指标
metric = 'perf_4' if 'CE' in category else 'acc_1'
# 生成subsets时不需要在数据集名称中包含CircularEval信息
cleaned_category = category.replace('-CE', '').replace('-Direct', '')
needlebench_atc_summary_groups.append({
'name': category,
'subsets': [
[f'NeedleBenchATCDataset-{num_needles}Needle-{cleaned_category}', metric]
for num_needles in needle_num_list
],
'weights': {f'NeedleBenchATCDataset-{num_needles}Needle-{cleaned_category}': num_needles for num_needles in needle_num_list},
})
needlebench_atc_summary_groups.append({
'name': 'ATC-CE-Overall',
'subsets': [
[f'{category}', 'weighted_average'] for category in categories
],
})
atc_dataset_abbrs = []
atc_dataset_abbrs.append(['ATC-CE-Overall', 'naive_average'])
for category in categories:
weighted_average_score_entry = [f'{category}', 'weighted_average']
atc_dataset_abbrs.append(weighted_average_score_entry)
needlebench_atc_summarizer = dict(
dataset_abbrs=[
*atc_dataset_abbrs,
'######## Needlebench-ATC Accuracy ########', # category
*[[f'NeedleBenchATCDataset-{num_needles}Needle-ZH', 'acc_1'] for num_needles in needle_num_list],
'------------------------------------------',
*[[f'NeedleBenchATCDataset-{num_needles}Needle-EN', 'acc_1'] for num_needles in needle_num_list],
'------------------------------------------',
*[[f'NeedleBenchATCDataset-{num_needles}Needle-ZH-Reasoning', 'acc_1'] for num_needles in needle_num_list],
'------------------------------------------',
*[[f'NeedleBenchATCDataset-{num_needles}Needle-EN-Reasoning', 'acc_1'] for num_needles in needle_num_list],
'------------------------------------------',
'######## Needlebench-ATC CircularEval ########', # category
*[[f'NeedleBenchATCDataset-{num_needles}Needle-ZH', 'perf_4'] for num_needles in needle_num_list],
'------------------------------------------',
*[[f'NeedleBenchATCDataset-{num_needles}Needle-EN', 'perf_4'] for num_needles in needle_num_list],
'------------------------------------------',
*[[f'NeedleBenchATCDataset-{num_needles}Needle-ZH-Reasoning', 'perf_4'] for num_needles in needle_num_list],
'------------------------------------------',
*[[f'NeedleBenchATCDataset-{num_needles}Needle-EN-Reasoning', 'perf_4'] for num_needles in needle_num_list],
'------------------------------------------',
],
summary_groups=needlebench_atc_summary_groups
)
return needlebench_atc_summarizer
atc_summarizer_20 = gen_atc_summarizer(list(range(2, 20, 1)))
atc_summarizer_50 = gen_atc_summarizer(list(range(2, 50, 1)))
atc_summarizer_80 = gen_atc_summarizer(list(range(2, 80, 1)))
context_lengths_4k = [1000, 2000, 3000, 4000]
needlebench_v2_4k_summarizer = create_summarizer(context_lengths_4k, depths_list_10, '4k', mean=True)
context_lengths_8k = [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000]
needlebench_v2_8k_summarizer = create_summarizer(context_lengths_8k, depths_list_10, '8k', mean=True)
context_lengths_32k = [1000, 4000, 8000, 12000, 16000, 20000, 24000, 28000, 32000]
needlebench_v2_32k_summarizer = create_summarizer(context_lengths_32k, depths_list_10, '32k', mean=True)
context_lengths_128k = [1000, 2000, 4000, 8000, 16000, 32000, 64000, 128000]
needlebench_v2_128k_summarizer = create_summarizer(context_lengths_128k, depths_list_10, '128k', mean=True)
context_lengths_200k = [16000, 48000, 80000, 112000, 128000, 144000, 176000, 200000]
needlebench_v2_200k_summarizer = create_summarizer(context_lengths_200k, depths_list_10, '200k', mean=True)
context_lengths_256k = [32000, 128000, 256000]
needlebench_v2_256k_summarizer = create_summarizer(context_lengths_256k, depths_list_10, '256k', mean=True)
context_lengths_1000k = [20000, 160000, 300000, 440000, 580000, 720000, 860000, 1000000]
needlebench_v2_1000k_summarizer = create_summarizer(context_lengths_1000k, depths_list_10, '1000k', mean=True)

View File

@ -0,0 +1,440 @@
# flake8: noqa
import json
import os
import random
import re
from enum import Enum
from datasets import Dataset
from opencompass.datasets.base import BaseDataset
from opencompass.datasets.needlebench_v2.atc_elder_only import (
NeedleBenchATCEvaluator, clean_atc_answer, needlebench_atc_postprocess_v2)
from opencompass.registry import (ICL_EVALUATORS, LOAD_DATASET,
TEXT_POSTPROCESSORS)
from opencompass.utils import get_data_path
# 定义问题类型枚举
class QuestionType(Enum):
ELDEST_ANCESTOR = 0 # 最年长祖先
NTH_ANCESTOR = 1 # N级祖先
NTH_DESCENDANT = 2 # N级子节点
RELATIONSHIP_DISTANCE = 3 # 关系距离
# 定义关系术语的代数映射(一代关系还是两代关系)
relationship_generation_map_zh = {
'父亲': 1,
'母亲': 1,
'爸爸': 1,
'妈妈': 1,
'爷爷': 2,
'奶奶': 2,
'姥姥': 2,
'姥爷': 2,
'外公': 2,
'外婆': 2,
}
relationship_generation_map_en = {
'father': 1,
'mother': 1,
'dad': 1,
'mom': 1,
'grandfather': 2,
'grandmother': 2,
'maternal grandmother': 2,
'maternal grandfather': 2,
'paternal grandfather': 2,
'paternal grandmother': 2,
}
relationship_templates_zh_CN = [
'{A}{B}{relationship}',
'{B}{relationship}{A}',
'{A}作为{B}{relationship},对{B}的成长有重要影响。',
'{A}不仅是{B}{relationship},还是{B}的榜样。',
'{A}{B}的成长过程中,不仅仅是{B}{relationship},还是{B}的监护人。',
'{A}{B}来说,不只是一个{relationship},还是一个朋友。',
]
relationship_terms_zh_CN = [
'父亲',
'母亲',
'爸爸',
'妈妈',
'爷爷',
'奶奶',
'姥姥',
'姥爷',
'外公',
'外婆',
]
relationship_terms_en = [
'father',
'mother',
'dad',
'mom',
'grandfather',
'grandmother',
'maternal grandmother',
'maternal grandfather',
'paternal grandfather',
'paternal grandmother',
]
relationship_templates_en = [
"{A} is {B}'s {relationship}.",
"{B}'s {relationship} is {A}.",
("{A}, as {B}'s {relationship}, "
"has a significant impact on {B}'s upbringing."),
("{A} is not only {B}'s {relationship} "
"but also {B}'s role model."),
("During {B}'s upbringing, {A} was not only {B}'s {relationship}, "
"but also {B}'s guardian."),
('For {B}, {A} is not just a {relationship}, '
'but also a friend.'),
'For {B}, {A} is more than just a {relationship}; {A} is a lifelong mentor of {B}.',
]
# Eldest ancestor problem template
shuffled_story_with_prompt_zh_CN = """下面是对你的多步推理能力的测试,这个测试叫做祖先追溯测试,我们会模拟不同人的家庭亲属关系,你的任务是在其中不断推理,直到找到最年长的祖先。
例如
例子1.如果张强的父亲是马克除此以外提供的文本中没有更多关于亲属关系的信息那么在提供的文本中张强能够向上追溯到的最年长的亲人就是马克
例子2.如果李明的姥姥是张红而张红的父亲是张强除此以外提供的文本中没有更多关于亲属关系的信息那么在提供的文本中李明能够向上追溯到的最年长的亲人就是张强
例子3.如果小明是张红的曾孙女张红的祖母是王华王华的父亲是王刚除此以外提供的文本中没有更多关于亲属关系的信息那么小明能够向上追溯到的最年长的亲人就是王刚
注意
1. 你不必纠结这个测试中的人名的性别关系例如一个通常被视为女性化的名字仍然可以是其他人的父亲我们的重点是谁更年长
2. 忽略这个测试中的姓氏遗传问题例如李明仍然可能是王鹏的亲生父亲我们只关注谁更年长不必纠结孩子是否应该继承父亲或母亲的性别
3. 在回答的最后将你的答案放在\\boxed{{}}例如"所以{last_person}能向上追溯到的最年长的亲人就是\\boxed{{某人(你的答案)}}"
现在打乱的家族关系文本如下
{shuffled_story}
在上面提供的打乱的家族关系文本中'{last_person}'的能够向上追溯到的最年长的亲人是谁
"""
shuffled_story_with_prompt_en = """Here is a test for multi-step reasoning ability called the Ancestral Trace Challenge. In this test, we will simulate different people's familial relationships, and your task is to continuously reason through them until you identify the eldest ancestor.
For example:
Example 1: If James Hill's father is Jasmine Lane, and no further information about familial relationships is provided in the text, then the oldest relative James Hill can trace back to in the provided text is \\boxed{{Jasmine Lane}}.
Example 2: If Andrew Williams's grandmother is Dan Newton, and Dan Newton's father is James Hill, and no further information about familial relationships is provided in the text, then the oldest relative Andrew Williams can trace back to in the provided text is \\boxed{{James Hill}}.
Example 3: If Jeff White's father is Kevin Le, Dan Newton's grandmother is Jeff White, and Jeff White's father is Kevin Le, and Shelley Mills is Dan Newton's great-granddaughter, and no further information about familial relationships is provided in the text, then the oldest relative Shelley Mills can trace back to in the provided text is \\boxed{{Kevin Le}}.
Notes:
1. You do not need to worry about the gender consistency of names in this test. For example, a name that is typically considered feminine can still be the father of another person. Our primary focus is on who is older.
2. Ignore surname inheritance issues. For instance, Andrew Williams could still be the biological father of Christopher Baker. We only care about who is older and do not need to consider whether a child should inherit the father's or mother's surname.
3. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the oldest relative '{last_person}' can trace back to in the provided text is \\boxed{{somebody (your answer here)}}."
Now, the scrambled family relationships are provided below:
{shuffled_story}
Given the scrambled family relationships described above, who is the eldest relative that '{last_person}' can trace back to in the context?
"""
# Nth ancestor problem template
nth_ancestor_prompt_zh_CN = """下面是对你的多步推理能力的测试,这个测试叫做祖先追溯测试,我们会模拟不同人的家庭亲属关系,你的任务是在其中不断推理,找到指定人物的特定代祖先。
例如
例子1.如果张强的父亲是马克我们说马克是张强的1代祖先
例子2.如果李明的姥姥是张红姥姥算两代关系而张红的父亲是张强那么张红是李明的2代祖先张强是李明的3代祖先
例子3.如果小明的奶奶是王华奶奶算两代关系王华的妈妈是刘芳那么王华是小明的2代祖先刘芳是小明的3代祖先
注意
1. 你不必纠结这个测试中的人名的性别关系我们只关注辈分关系
2. 忽略这个测试中的姓氏遗传问题我们只关注亲属关系
3. 父亲/母亲/爸爸/妈妈算1代关系爷爷/奶奶/姥姥/姥爷/外公/外婆算2代关系
4. 在回答的最后将你的答案放在\\boxed{{}}例如"所以{person}{n}代祖先就是\\boxed{{某人(你的答案)}}"
现在打乱的家族关系文本如下
{shuffled_story}
在上面提供的打乱的家族关系文本中'{person}'{n}代祖先是谁
"""
nth_ancestor_prompt_en = """Here is a test for multi-step reasoning ability called the Ancestral Trace Challenge. In this test, we will simulate different people's familial relationships, and your task is to identify a specific ancestor of a given person.
For example:
Example 1: If James Hill's father is Jasmine Lane, then Jasmine Lane is James Hill's 1st generation ancestor.
Example 2: If Andrew Williams's grandmother is Dan Newton (grandmother counts as 2 generations), and Dan Newton's father is James Hill, then Dan Newton is Andrew Williams's 2nd generation ancestor, and James Hill is Andrew Williams's 3rd generation ancestor.
Example 3: If Shelley Mills's grandfather is Jeff White (grandfather counts as 2 generations), and Jeff White's mother is Mary Johnson, then Jeff White is Shelley Mills's 2nd generation ancestor, and Mary Johnson is Shelley Mills's 3rd generation ancestor.
Notes:
1. You do not need to worry about the gender consistency of names in this test. We only care about generational relationships.
2. Ignore surname inheritance issues. We only care about familial relationships.
3. Father/mother/dad/mom count as 1 generation, while grandfather/grandmother/maternal grandmother/maternal grandfather/paternal grandfather/paternal grandmother count as 2 generations.
4. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the {n}th generation ancestor of '{person}' is \\boxed{{somebody (your answer here)}}."
Now, the scrambled family relationships are provided below:
{shuffled_story}
Given the scrambled family relationships described above, who is the {n}th generation ancestor of '{person}'?
"""
# Nth descendant problem template
nth_descendant_prompt_zh_CN = """下面是对你的多步推理能力的测试,这个测试叫做家族关系追溯测试,我们会模拟不同人的家庭亲属关系,你的任务是在其中不断推理,找到指定人物的特定代子孙。
例如
例子1.如果马克是张强的父亲我们说张强是马克的1代子孙
例子2.如果张红是李明的姥姥姥姥算两代关系而张强是张红的父亲那么李明是张红的2代子孙李明是张强的3代子孙
例子3.如果王华是小明的爷爷爷爷算两代关系刘芳是王华的妈妈那么小明是王华的2代子孙小明是刘芳的3代子孙
注意
1. 你不必纠结这个测试中的人名的性别关系我们只关注辈分关系
2. 忽略这个测试中的姓氏遗传问题我们只关注亲属关系
3. 父亲/母亲/爸爸/妈妈算1代关系爷爷/奶奶/姥姥/姥爷/外公/外婆算2代关系
4. 在回答的最后将你的答案放在\\boxed{{}}例如"所以{person}{n}代子孙就是\\boxed{{某人(你的答案)}}"
现在打乱的家族关系文本如下
{shuffled_story}
在上面提供的打乱的家族关系文本中'{person}'{n}代子孙是谁
"""
nth_descendant_prompt_en = """Here is a test for multi-step reasoning ability called the Ancestral Trace Challenge. In this test, we will simulate different people's familial relationships, and your task is to identify a specific descendant of a given person.
For example:
Example 1: If Jasmine Lane is James Hill's father, then James Hill is Jasmine Lane's 1st generation descendant.
Example 2: If Dan Newton is Andrew Williams's grandmother (grandmother counts as 2 generations), and James Hill is Dan Newton's father, then Andrew Williams is Dan Newton's 2nd generation descendant, and Andrew Williams is James Hill's 3rd generation descendant.
Example 3: If Jeff White is Shelley Mills's grandfather (grandfather counts as 2 generations), and Mary Johnson is Jeff White's mother, then Shelley Mills is Jeff White's 2nd generation descendant, and Shelley Mills is Mary Johnson's 3rd generation descendant.
Notes:
1. You do not need to worry about the gender consistency of names in this test. We only care about generational relationships.
2. Ignore surname inheritance issues. We only care about familial relationships.
3. Father/mother/dad/mom count as 1 generation, while grandfather/grandmother/maternal grandmother/maternal grandfather/paternal grandfather/paternal grandmother count as 2 generations.
4. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the {n}th generation descendant of '{person}' is \\boxed{{somebody (your answer here)}}."
Now, the scrambled family relationships are provided below:
{shuffled_story}
Given the scrambled family relationships described above, who is the {n}th generation descendant of '{person}'?
"""
# Relationship distance problem template
relationship_distance_prompt_zh_CN = """下面是对你的多步推理能力的测试,这个测试叫做家族关系追溯测试,我们会模拟不同人的家庭亲属关系,你的任务是在其中不断推理,计算两个人之间的关系距离。
关系距离定义为家族图中从一个人到另一个人所需的最少代数差距注意不同关系有不同的代数差距例如
例子1.如果马克是张强的父亲父亲算1代关系那么张强和马克之间的关系距离是1
例子2.如果张红是李明的姥姥姥姥算2代关系而张强是张红的父亲父亲算1代关系那么李明和张红之间的关系距离是2李明和张强之间的关系距离是3
例子3.如果小明的爷爷是王华爷爷算2代关系王华的妈妈是刘芳妈妈算1代关系那么小明和王华之间的关系距离是2小明和刘芳之间的关系距离是3
注意
1. 你不必纠结这个测试中的人名的性别关系我们只关注辈分关系
2. 忽略这个测试中的姓氏遗传问题我们只关注亲属关系
3. 父亲/母亲/爸爸/妈妈算1代关系爷爷/奶奶/姥姥/姥爷/外公/外婆算2代关系
4. 在回答的最后将你的答案放在\\boxed{{}}例如"所以{person_a}{person_b}之间的关系距离是\\boxed{{5}}"
现在打乱的家族关系文本如下
{shuffled_story}
在上面提供的打乱的家族关系文本中'{person_a}''{person_b}'之间的关系距离是多少
"""
relationship_distance_prompt_en = """Here is a test for multi-step reasoning ability called the Ancestral Trace Challenge. In this test, we will simulate different people's familial relationships, and your task is to calculate the relationship distance between two individuals.
The relationship distance is defined as the minimum number of generational gaps needed to go from one person to another in the family graph. Note that different relationships have different generational gaps. For example:
Example 1: If Jasmine Lane is James Hill's father (father counts as 1 generation), then the relationship distance between James Hill and Jasmine Lane is 1.
Example 2: If Dan Newton is Andrew Williams's grandmother (grandmother counts as 2 generations), and James Hill is Dan Newton's father (father counts as 1 generation), then the relationship distance between Andrew Williams and Dan Newton is 2, and the relationship distance between Andrew Williams and James Hill is 3.
Example 3: If Jeff White is Shelley Mills's grandfather (grandfather counts as 2 generations), and Mary Johnson is Jeff White's mother (mother counts as 1 generation), then the relationship distance between Shelley Mills and Jeff White is 2, and the relationship distance between Shelley Mills and Mary Johnson is 3.
Notes:
1. You do not need to worry about the gender consistency of names in this test. We only care about relationship connections.
2. Ignore surname inheritance issues. We only care about familial relationships.
3. Father/mother/dad/mom count as 1 generation, while grandfather/grandmother/maternal grandmother/maternal grandfather/paternal grandfather/paternal grandmother count as 2 generations.
4. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the relationship distance between '{person_a}' and '{person_b}' is \\boxed{{5}}."
Now, the scrambled family relationships are provided below:
{shuffled_story}
Given the scrambled family relationships described above, what is the relationship distance between '{person_a}' and '{person_b}'?
"""
@LOAD_DATASET.register_module()
class NeedleBenchATCDataset(BaseDataset):
@staticmethod
def load(
path,
file_name: str,
num_needles: int,
language: str,
repeats: int,
# This parameter cannot be passed through mmengine because it is blocked as lazy
question_types: list[QuestionType] = [
QuestionType.ELDEST_ANCESTOR,
QuestionType.NTH_ANCESTOR,
QuestionType.NTH_DESCENDANT,
QuestionType.RELATIONSHIP_DISTANCE,
], # Support specifying a list of question types
):
data = {'prompt': [], 'answer': [], 'question_type': []}
path = get_data_path(path)
if os.environ.get('DATASET_SOURCE') == 'HF':
from huggingface_hub import snapshot_download
path = snapshot_download(repo_id=path, repo_type='dataset')
file_path = os.path.join(path, file_name)
with open(file_path, 'r', encoding='utf-8') as file:
names_data = json.load(file)
all_names = names_data[language].split(',')
# Ensure question_types is not empty
if not question_types:
raise ValueError('question_types cannot be empty')
for question_type in question_types:
# Generate the specified number of examples for each question type
for i in range(repeats):
# Set a different seed for each question type and repeat
# Use the enum value of the question type multiplied by 10000 as the base to ensure non-overlapping seed ranges
seed = (i + 1) + (10000 * question_type.value)
random.seed(seed)
# Randomly select the specified number of names from all names
# The number of names is num_needles + 1
names = random.sample(all_names, num_needles + 1)
# Select the corresponding relationship terms and templates according to the language
if language == 'Chinese':
relationship_terms = relationship_terms_zh_CN
relationship_templates = relationship_templates_zh_CN
relationship_map = relationship_generation_map_zh
elif language == 'English':
relationship_terms = relationship_terms_en
relationship_templates = relationship_templates_en
relationship_map = relationship_generation_map_en
else:
raise ValueError(
'Unsupported language specified. '
'Please choose either "Chinese" or "English".')
def generate_chain_family_story(names, templates,
relationship_terms,
relationship_map):
story = ''
relationships = []
total_generations = 0 # Track the total generational difference
for i in range(len(names) - 1):
template = random.choice(templates)
relation_term = random.choice(relationship_terms)
relation = template.format(A=names[i],
B=names[i + 1],
relationship=relation_term)
story += f'{relation}*'
# Get the generation difference for this relationship
gen_diff = relationship_map.get(
relation_term, 1) # Default to 1 generation
total_generations += gen_diff
# Record relationship information for later use
relationships.append(
(names[i], names[i + 1], relation_term, gen_diff))
return story, relationships, total_generations
chain_story, relationships, total_generations = generate_chain_family_story(
names, relationship_templates, relationship_terms,
relationship_map)
# Split the chain_story into a list of fragments
family_story_fragments = chain_story.split('*')
family_story_fragments = [
f for f in family_story_fragments if f
]
# Shuffle the list of fragments
random.shuffle(family_story_fragments)
# Join the shuffled fragments back into a string
shuffled_story = ''.join(family_story_fragments)
if question_type == QuestionType.ELDEST_ANCESTOR:
# Eldest ancestor question
last_person = names[-1]
if language == 'Chinese':
prompt = shuffled_story_with_prompt_zh_CN.format(
shuffled_story=shuffled_story,
last_person=last_person)
else:
prompt = shuffled_story_with_prompt_en.format(
shuffled_story=shuffled_story,
last_person=last_person)
answer = names[
0] # The first person is the eldest ancestor
elif question_type == QuestionType.NTH_ANCESTOR:
# Nth ancestor question - trace from the youngest person to the oldest
person = names[
-1] # The youngest person (end of the chain)
n = total_generations # Use the calculated total generational difference
if language == 'Chinese':
prompt = nth_ancestor_prompt_zh_CN.format(
shuffled_story=shuffled_story, person=person, n=n)
else:
prompt = nth_ancestor_prompt_en.format(
shuffled_story=shuffled_story, person=person, n=n)
answer = names[
0] # The oldest person (start of the chain) is the nth ancestor
elif question_type == QuestionType.NTH_DESCENDANT:
# Nth descendant question - trace from the oldest person to the youngest
person = names[0] # The oldest person (start of the chain)
n = total_generations # Use the calculated total generational difference
if language == 'Chinese':
prompt = nth_descendant_prompt_zh_CN.format(
shuffled_story=shuffled_story, person=person, n=n)
else:
prompt = nth_descendant_prompt_en.format(
shuffled_story=shuffled_story, person=person, n=n)
answer = names[
-1] # The youngest person (end of the chain) is the nth descendant
elif question_type == QuestionType.RELATIONSHIP_DISTANCE:
# Relationship distance question - calculate the relationship distance between the two ends of the chain
person_a = names[0] # The oldest person
person_b = names[-1] # The youngest person
if language == 'Chinese':
prompt = relationship_distance_prompt_zh_CN.format(
shuffled_story=shuffled_story,
person_a=person_a,
person_b=person_b)
else:
prompt = relationship_distance_prompt_en.format(
shuffled_story=shuffled_story,
person_a=person_a,
person_b=person_b)
# Use the calculated total generations as the relationship distance
answer = str(total_generations)
else:
# Default fallback to eldest ancestor question
last_person = names[-1]
if language == 'Chinese':
prompt = shuffled_story_with_prompt_zh_CN.format(
shuffled_story=shuffled_story,
last_person=last_person)
else:
prompt = shuffled_story_with_prompt_en.format(
shuffled_story=shuffled_story,
last_person=last_person)
answer = names[
0] # The first person is the eldest ancestor
data['prompt'].append(prompt)
data['answer'].append(answer)
data['question_type'].append(question_type.name)
dataset = Dataset.from_dict({
'prompt': data['prompt'],
'answer': data['answer'],
'question_type': data['question_type'],
})
return dataset

View File

@ -0,0 +1,253 @@
# flake8: noqa
import json
import os
import random
import re
from datasets import Dataset
from opencompass.datasets.base import BaseDataset
from opencompass.datasets.math import extract_boxed_answer
from opencompass.openicl.icl_evaluator import BaseEvaluator
from opencompass.registry import (ICL_EVALUATORS, LOAD_DATASET,
TEXT_POSTPROCESSORS)
from opencompass.utils import get_data_path
relationship_templates_zh_CN = [
'{A}{B}{relationship}',
'{B}{relationship}{A}',
'{A}作为{B}{relationship},对{B}的成长有重要影响。',
'{A}不仅是{B}{relationship},还是{B}的榜样。',
'{B}{A}所生的孩子。',
'{A}{B}来说,不只是一个{relationship},还是一个朋友。',
]
relationship_terms_zh_CN = [
'父亲',
'母亲',
'爸爸',
'妈妈',
'爷爷',
'奶奶',
'姥姥',
'姥爷',
'外公',
'外婆',
]
relationship_terms_en = [
'father',
'mother',
'dad',
'mom',
'grandfather',
'grandmother',
'maternal grandmother',
'maternal grandfather',
'paternal grandfather',
'paternal grandmother',
]
relationship_templates_en = [
"{A} is {B}'s {relationship}.",
"{B}'s {relationship} is {A}.",
("{A}, as {B}'s {relationship}, "
"has a significant impact on {B}'s upbringing."),
("{A} is not only {B}'s {relationship} "
"but also {B}'s role model."),
'{B} is the child of {A}.',
('For {B}, {A} is not just a {relationship}, '
'but also a friend.'),
'For {B}, {A} is more than just a {relationship}; {A} is a lifelong mentor of {B}.',
]
shuffled_story_with_prompt_zh_CN = """下面是对你的多步推理能力的测试,这个测试叫做祖先追溯测试,我们会模拟不同人的家庭亲属关系,你的任务是在其中不断推理,直到找到最年长的祖先。
例如
例子1.如果张强的父亲是马克除此以外提供的文本中没有更多关于亲属关系的信息那么在提供的文本中张强能够向上追溯到的最年长的亲人就是马克
例子2.如果李明的姥姥是张红而张红的父亲是张强除此以外提供的文本中没有更多关于亲属关系的信息那么在提供的文本中李明能够向上追溯到的最年长的亲人就是张强
例子3.如果小明是张红的曾孙女张红的祖母是王华王华的父亲是王刚除此以外提供的文本中没有更多关于亲属关系的信息那么小明能够向上追溯到的最年长的亲人就是王刚
注意
1. 你不必纠结这个测试中的人名的性别关系例如一个通常被视为女性化的名字仍然可以是其他人的父亲我们的重点是谁更年长
2. 忽略这个测试中的姓氏遗传问题例如李明仍然可能是王鹏的亲生父亲我们只关注谁更年长不必纠结孩子是否应该继承父亲或母亲的性别
3. 在回答的最后将你的答案放在\\boxed{{}}例如"所以{last_person}能向上追溯到的最年长的亲人就是\\boxed{{某人(你的答案)}}"
现在打乱的家族关系文本如下
{shuffled_story}
在上面提供的打乱的家族关系文本中'{last_person}'的能够向上追溯到的最年长的亲人是谁
"""
shuffled_story_with_prompt_en = """Here is a test for multi-step reasoning ability called the Ancestral Trace Challenge. In this test, we will simulate different people's familial relationships, and your task is to continuously reason through them until you identify the eldest ancestor.
For example:
Example 1: If James Hill's father is Jasmine Lane, and no further information about familial relationships is provided in the text, then the oldest relative James Hill can trace back to in the provided text is \\boxed{{Jasmine Lane}}.
Example 2: If Andrew Williams's grandmother is Dan Newton, and Dan Newton's father is James Hill, and no further information about familial relationships is provided in the text, then the oldest relative Andrew Williams can trace back to in the provided text is \\boxed{{James Hill}}.
Example 3: If Jeff White's father is Kevin Le, Dan Newton's grandmother is Jeff White, and Jeff White's father is Kevin Le, and Shelley Mills is Dan Newton's great-granddaughter, and no further information about familial relationships is provided in the text, then the oldest relative Shelley Mills can trace back to in the provided text is \\boxed{{Kevin Le}}.
Notes:
1. You do not need to worry about the gender consistency of names in this test. For example, a name that is typically considered feminine can still be the father of another person. Our primary focus is on who is older.
2. Ignore surname inheritance issues. For instance, Andrew Williams could still be the biological father of Christopher Baker. We only care about who is older and do not need to consider whether a child should inherit the father's or mother's surname.
3. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the oldest relative '{last_person}' can trace back to in the provided text is \\boxed{{somebody (your answer here)}}."
Now, the scrambled family relationships are provided below:
{shuffled_story}
Given the scrambled family relationships described above, who is the eldest relative that '{last_person}' can trace back to in the context?
"""
@LOAD_DATASET.register_module()
class NeedleBenchATCDataset(BaseDataset):
@staticmethod
def load(
path,
file_name: str,
num_needles: int,
language: str,
repeats: int,
):
data = {'prompt': [], 'answer': []}
path = get_data_path(path)
if os.environ.get('DATASET_SOURCE') == 'HF':
from huggingface_hub import snapshot_download
path = snapshot_download(repo_id=path, repo_type='dataset')
file_path = os.path.join(path, file_name)
with open(file_path, 'r', encoding='utf-8') as file:
names_data = json.load(file)
all_names = names_data[language].split(',')
for i in range(repeats):
# 使用固定种子来保持样本稳定性
seed = i
random.seed(seed)
names = random.sample(all_names, num_needles)
if language == 'Chinese':
relationship_terms = relationship_terms_zh_CN
relationship_templates = relationship_templates_zh_CN
elif language == 'English':
relationship_terms = relationship_terms_en
relationship_templates = relationship_templates_en
def generate_chain_family_story(names, templates,
relationship_terms):
story = ''
for i in range(len(names) - 1):
template = random.choice(templates)
relation_term = random.choice(relationship_terms)
relation = template.format(A=names[i],
B=names[i + 1],
relationship=relation_term)
story += f'{relation}*'
return story
chain_story = generate_chain_family_story(names,
relationship_templates,
relationship_terms)
# Splitting the chain_story into a list of fragments
family_story_fragments = chain_story.split('*')
# Shuffling the list of fragments
random.shuffle(family_story_fragments)
# Joining the shuffled fragments back into a string
shuffled_story = ''.join(family_story_fragments)
last_person = names[-1]
# Generating the prompt based on the language
if language == 'Chinese':
shuffled_story_with_prompt = shuffled_story_with_prompt_zh_CN.format(
shuffled_story=shuffled_story, last_person=last_person)
elif language == 'English':
shuffled_story_with_prompt = shuffled_story_with_prompt_en.format(
shuffled_story=shuffled_story, last_person=last_person)
else:
prompt = 'Language not supported.'
raise Exception('Unsupported language specified. '
"Please choose either 'Chinese' or 'English'.")
data['prompt'].append(shuffled_story_with_prompt)
data['answer'].append(names[0])
dataset = Dataset.from_dict({
'prompt': data['prompt'],
'answer': data['answer'],
})
return dataset
def clean_atc_answer(text: str) -> str:
"""Clean answer format specifically for QwQ-32B-Preview model.
Args:
text: Raw prediction text
Returns:
Standardized name format after cleaning
"""
if not text or text == 'None':
return 'None'
# Remove LaTeX commands but keep content
text = re.sub(r'\\text\{([^}]+)\}', r'\1', text)
text = re.sub(r'\\boxed\{([^}]+)\}', r'\1', text)
text = re.sub(r'\\[\[\]]', '', text)
# Remove extra backslashes
text = text.replace('\\\\', '').replace('\\', '')
# Handle extra spaces
text = re.sub(r'\s+', ' ', text).strip()
# Remove quotes
text = text.replace('"', '').replace("'", '')
# Remove tildes (波浪符号)
text = text.replace('~', ' ')
return text
@TEXT_POSTPROCESSORS.register_module('needlebench_atc_postprocess_v2')
def needlebench_atc_postprocess_v2(text: str) -> str:
cand_ans = extract_boxed_answer(text, strip_double_curly_brace=True)
if cand_ans:
return clean_atc_answer(cand_ans)
return 'None'
@ICL_EVALUATORS.register_module('needlebench_atc_evaluator')
class NeedleBenchATCEvaluator(BaseEvaluator):
def score(self, predictions, gold):
if len(predictions) != len(gold):
return {'error': 'predictions and gold have different lengths'}
correct_count = 0
details = []
for prediction, reference in zip(predictions, gold):
reference_name = reference
if prediction.strip() == reference_name.strip():
correct_count += 1
detail = {
'pred': prediction,
'answer': reference_name,
'correct': prediction.strip() == reference_name.strip()
}
details.append(detail)
accuracy = (correct_count /
len(predictions)) * 100 if predictions else 0
result = {'score': accuracy, 'details': details}
return result

View File

@ -0,0 +1,300 @@
# flake8: noqa: E501
import json
import os
import random
import tiktoken
from datasets import Dataset
from huggingface_hub import hf_hub_download
from opencompass.datasets.base import BaseDataset
from opencompass.datasets.needlebench_v2.atc import (
relationship_templates_en, relationship_templates_zh_CN,
relationship_terms_en, relationship_terms_zh_CN)
from opencompass.registry import LOAD_DATASET
def get_random_needles(counter, file_path, num_needles, language):
with open(file_path, 'r', encoding='utf-8') as file:
names_data = json.load(file)
all_names = names_data[language].split(',')
random.seed(counter)
names = random.sample(all_names, num_needles)
if language == 'Chinese':
relationship_terms = relationship_terms_zh_CN
relationship_templates = relationship_templates_zh_CN
elif language == 'English':
relationship_terms = relationship_terms_en
relationship_templates = relationship_templates_en
else:
raise ValueError(f"Unsupported language '{language}' specified.")
def generate_chain_family_story(names, templates, relationship_terms):
story = ''
for i in range(len(names) - 1):
template = random.choice(templates)
relation_term = random.choice(relationship_terms)
relation = template.format(A=names[i],
B=names[i + 1],
relationship=relation_term)
story += f'{relation}*'
return story
chain_story = generate_chain_family_story(names, relationship_templates,
relationship_terms)
# Splitting the chain_story into a list of fragments
family_story_fragments = chain_story.split('*')
# Removing the empty string from the list
family_story_fragments = [
fragment for fragment in family_story_fragments if fragment
]
# Shuffling the list of fragments
random.shuffle(family_story_fragments)
last_person = names[-1]
# Generating the retrieval question based on the language
if language == 'Chinese':
retrieval_question = f"在上面提供的文本中,'{last_person}'的能够向上追溯到的最年长的亲人是谁?"
elif language == 'English':
retrieval_question = f"Given the context described above, who is the eldest relative that '{last_person}' can trace back to in the context?"
# Returning the story, answer, and retrieval question
return {
'needles': family_story_fragments,
'answer': names[0],
'retrieval_question': retrieval_question,
'last_person': last_person
}
@LOAD_DATASET.register_module()
class NeedleBenchMultiDataset(BaseDataset):
@staticmethod
def load(
path: str, # depreciated
length: int,
depth: int,
tokenizer_model: str,
file_list: 'list[str]',
num_repeats_per_file: int,
length_buffer: int,
language: str,
needle_file_name: str,
num_needles: int,
diff: int,
quesiton_position: str = 'End',
):
data = {'prompt': [], 'answer': []}
tokenizer = tiktoken.encoding_for_model(tokenizer_model)
def _generate_context(tokens_context, depth_percent, needles):
tokens_needle = [
_get_tokens_from_context(needle) for needle in needles
]
insertion_points = []
total_length = len(tokens_context)
for i, needle_tokens in enumerate(tokens_needle):
if i == 0:
insertion_point = int(total_length * (depth_percent / 100))
else:
insertion_point = int(insertion_points[i - 1] +
len(tokens_needle[i - 1]) +
total_length * (diff / 100))
insertion_point = min(
insertion_point,
total_length + sum(len(tn) for tn in tokens_needle[:i]))
insertion_points.append(insertion_point)
for i, needle_tokens in enumerate(tokens_needle):
tokens_context = tokens_context[:insertion_points[i]] \
+ needle_tokens + tokens_context[insertion_points[i]:]
for j in range(i + 1, len(insertion_points)):
insertion_points[j] += len(needle_tokens)
new_context = _decode_tokens(tokens_context)
return new_context
def _get_tokens_from_context(context):
if isinstance(context, list):
return [tokenizer.encode(item) for item in context]
else:
return tokenizer.encode(context)
def _decode_tokens(tokens):
return tokenizer.decode(tokens)
def _generate_prompt(context, retrieval_question, last_person):
if language == 'Chinese':
if quesiton_position == 'End':
prompt = f'''这是一个长文本能力的测试,你需要首先阅读下面的长文档,然后根据文档中的信息回答最后的问题。
长文档的内容如下
<文档>
{context}
</文档>
根据文档中的信息现在请问{retrieval_question}
例如
例子1.如果张强的父亲是马克除此以外提供的文本中没有更多关于亲属关系的信息那么在提供的文本中张强能够向上追溯到的最年长的亲人就是马克
例子2.如果李明的姥姥是张红而张红的父亲是张强除此以外提供的文本中没有更多关于亲属关系的信息那么在提供的文本中李明能够向上追溯到的最年长的亲人就是张强
例子3.如果小明是张红的曾孙女张红的祖母是王华王华的父亲是王刚除此以外提供的文本中没有更多关于亲属关系的信息那么小明能够向上追溯到的最年长的亲人就是王刚
注意
1. 你不必纠结这个测试中的人名的性别关系例如一个通常被视为女性化的名字仍然可以是其他人的父亲我们的重点是谁更年长
2. 忽略这个测试中的姓氏遗传问题例如李明仍然可能是王鹏的亲生父亲我们只关注谁更年长不必纠结孩子是否应该继承父亲或母亲的性别
3. 在回答的最后将你的答案放在\\boxed{{}}例如所以{last_person}能向上追溯到的最年长的亲人就是\\boxed{{你的答案}}
'''
elif quesiton_position == 'Start':
prompt = f'''这是一个长文本能力的测试,你需要首先阅读下面的问题,然后根据最后长文档中的信息回答下面的问题。
现在请问{retrieval_question}
例如
例子1.如果张强的父亲是马克除此以外提供的文本中没有更多关于亲属关系的信息那么在提供的文本中张强能够向上追溯到的最年长的亲人就是马克
例子2.如果李明的姥姥是张红而张红的父亲是张强除此以外提供的文本中没有更多关于亲属关系的信息那么在提供的文本中李明能够向上追溯到的最年长的亲人就是张强
例子3.如果小明是张红的曾孙女张红的祖母是王华王华的父亲是王刚除此以外提供的文本中没有更多关于亲属关系的信息那么小明能够向上追溯到的最年长的亲人就是王刚
注意
1. 你不必纠结这个测试中的人名的性别关系例如一个通常被视为女性化的名字仍然可以是其他人的父亲我们的重点是谁更年长
2. 忽略这个测试中的姓氏遗传问题例如李明仍然可能是王鹏的亲生父亲我们只关注谁更年长不必纠结孩子是否应该继承父亲或母亲的性别
3. 在回答的最后将你的答案放在\\boxed{{}}例如所以{last_person}能向上追溯到的最年长的亲人就是\\boxed{{你的答案}}
长文档内容的如下
<文档>
{context}
</文档>
'''
else:
raise ValueError('Unsupported quesiton_position. '
'Position must be "End" or "Start".')
elif language == 'English':
if quesiton_position == 'End':
prompt = f'''This is a test of long-text capability. You need to first read the long document below, and then answer the final question based on the information in the document.
The content of the long document is as follows
<Document>
{context}
</Document>
Based on the information in the document, now please answer: {retrieval_question}
For example:
Example 1: If James Hill's father is Jasmine Lane, and no further information about familial relationships is provided in the text, then the oldest relative James Hill can trace back to in the provided text is \\boxed{{Jasmine Lane}}.
Example 2: If Andrew Williams's grandmother is Dan Newton, and Dan Newton's father is James Hill, and no further information about familial relationships is provided in the text, then the oldest relative Andrew Williams can trace back to in the provided text is \\boxed{{James Hill}}.
Example 3: If Jeff White's father is Kevin Le, Dan Newton's grandmother is Jeff White, and Jeff White's father is Kevin Le, and Shelley Mills is Dan Newton's great-granddaughter, and no further information about familial relationships is provided in the text, then the oldest relative Shelley Mills can trace back to in the provided text is \\boxed{{Kevin Le}}.
Notes:
1. You do not need to worry about the gender consistency of names in this test. For example, a name that is typically considered feminine can still be the father of another person. Our primary focus is on who is older.
2. Ignore surname inheritance issues. For instance, Andrew Williams could still be the biological father of Christopher Baker. We only care about who is older and do not need to consider whether a child should inherit the father's or mother's surname.
3. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the oldest relative '{last_person}' can trace back to in the provided text is \\boxed{{(your answer here)}}."
'''
elif quesiton_position == 'Start':
prompt = f'''This is a test of long-text capability. You need to first read the question below, and then answer it based on the information in the long document that follows.
Now please answer: {retrieval_question}
For example:
Example 1: If James Hill's father is Jasmine Lane, and no further information about familial relationships is provided in the text, then the oldest relative James Hill can trace back to in the provided text is \\boxed{{Jasmine Lane}}.
Example 2: If Andrew Williams's grandmother is Dan Newton, and Dan Newton's father is James Hill, and no further information about familial relationships is provided in the text, then the oldest relative Andrew Williams can trace back to in the provided text is \\boxed{{James Hill}}.
Example 3: If Jeff White's father is Kevin Le, Dan Newton's grandmother is Jeff White, and Jeff White's father is Kevin Le, and Shelley Mills is Dan Newton's great-granddaughter, and no further information about familial relationships is provided in the text, then the oldest relative Shelley Mills can trace back to in the provided text is \\boxed{{Kevin Le}}.
Notes:
1. You do not need to worry about the gender consistency of names in this test. For example, a name that is typically considered feminine can still be the father of another person. Our primary focus is on who is older.
2. Ignore surname inheritance issues. For instance, Andrew Williams could still be the biological father of Christopher Baker. We only care about who is older and do not need to consider whether a child should inherit the father's or mother's surname.
3. At the end of your response, remember to put your final answer within \\boxed{{}}. For example: "So the oldest relative '{last_person}' can trace back to in the provided text is \\boxed{{(your answer here)}}."
The content of the long document is as follows
<Document>
{context}
</Document>
'''
else:
raise ValueError(
f'Unsupported quesiton_position {quesiton_position}. '
'Position must be "End" or "Start".')
else:
raise ValueError(f"Language '{language}' is not supported.")
return prompt
repo_id = 'opencompass/NeedleBench'
file_names = [
'PaulGrahamEssays.jsonl', 'names.json', 'zh_finance.jsonl',
'zh_game.jsonl', 'zh_general.jsonl', 'zh_government.jsonl',
'zh_movie.jsonl', 'zh_tech.jsonl'
]
downloaded_files = []
base_file_path = ''
for file_name in file_names:
file_path = hf_hub_download(repo_id=repo_id,
filename=file_name,
repo_type='dataset')
downloaded_files.append(file_path)
base_file_path = '/'.join(file_path.split('/')[:-1])
needle_file_path = os.path.join(base_file_path, needle_file_name)
for file_path in downloaded_files:
if file_path.split('/')[-1] not in file_list:
continue
with open(file_path, 'r', encoding='utf-8') as f:
lines_bak = [json.loads(line.strip()) for line in f]
lines = lines_bak.copy()
for counter in range(num_repeats_per_file):
random.seed(counter)
random.shuffle(lines)
random_needle_data = get_random_needles(
counter, needle_file_path, num_needles + 1, language)
last_person = random_needle_data['last_person']
needles = [
'\n' + needle + '\n'
for needle in random_needle_data['needles']
]
answer = random_needle_data['answer']
keyword = answer
retrieval_question = random_needle_data['retrieval_question']
context_length = length - length_buffer
target_length_per_record = context_length - \
sum(len(tokens) for tokens
in _get_tokens_from_context(needles))
target_length_per_record = max(target_length_per_record, 0)
accumulated_tokens = []
for line in lines:
tokens_current_line = _get_tokens_from_context(
line['text'])
accumulated_tokens.extend(tokens_current_line)
if len(accumulated_tokens) >= target_length_per_record:
break
processed_text = _generate_context(
accumulated_tokens[:target_length_per_record], depth,
needles)
processed_prompt = _generate_prompt(processed_text,
retrieval_question,
last_person)
data['prompt'].append(processed_prompt)
data['answer'].append(keyword)
dataset = Dataset.from_dict({
'prompt': data['prompt'],
'answer': data['answer'],
})
return dataset

View File

@ -0,0 +1,222 @@
# flake8: noqa: E501
import json
import os
import random
import re
import tiktoken
from datasets import Dataset
from opencompass.datasets.base import BaseDataset
from opencompass.openicl import BaseEvaluator
from opencompass.registry import LOAD_DATASET, TEXT_POSTPROCESSORS
from opencompass.utils import get_data_path
def get_random_line_by_language(counter, file_path, language):
with open(file_path, 'r', encoding='utf-8') as file:
lines = [
json.loads(line.strip()) for line in file
if json.loads(line.strip())['language'] == language
]
if lines:
random.seed(counter)
random_line = random.choice(lines)
return {
'needle': random_line['needle'],
'retrieval_question': random_line['retrieval_question'],
'keyword': random_line['arg2']
}
else:
return None
@LOAD_DATASET.register_module()
class NeedleBenchOriginDataset(BaseDataset):
@staticmethod
def load(
path: str,
length: int,
depth: int,
tokenizer_model: str,
file_list: list[str],
num_repeats_per_file: int,
length_buffer: int,
language: str,
needle_file_name: str,
quesiton_position: str = 'End',
):
data = {'prompt': [], 'answer': []}
tokenizer = tiktoken.encoding_for_model(tokenizer_model)
def _generate_context(tokens_context, depth_percent, needle):
tokens_needle = _get_tokens_from_context(needle)
insertion_point = int(len(tokens_context) * (depth_percent / 100))
tokens_context = (tokens_context[:insertion_point] +
tokens_needle + tokens_context[insertion_point:])
new_context = _decode_tokens(tokens_context)
return new_context
def _get_tokens_from_context(context):
return tokenizer.encode(context)
def _decode_tokens(tokens):
return tokenizer.decode(tokens)
def _generate_prompt(context, retrieval_question):
if language == 'Chinese':
if quesiton_position == 'End':
prompt = f'''这是一个长文本能力的测试,你需要首先阅读下面的长文档,然后根据文档中的信息回答最后的问题。
长文档的内容如下
<文档>
{context}
</文档>
根据文档中的信息现在请问{retrieval_question}
'''
elif quesiton_position == 'Start':
prompt = f'''这是一个长文本能力的测试,你需要首先阅读下面的问题,然后根据最后长文档中的信息回答下面的问题。
现在请问{retrieval_question}
长文档内容的如下
<文档>
{context}
</文档>
'''
else:
raise ValueError('Unsupported quesiton_position. '
'Position must be "End" or "Start".')
elif language == 'English':
if quesiton_position == 'End':
prompt = f'''This is a test of long-text capability. You need to first read the long document below, and then answer the final question based on the information in the document.
The content of the long document is as follows
<Document>
{context}
</Document>
Based on the information in the document, now please answer: {retrieval_question}
'''
elif quesiton_position == 'Start':
prompt = f'''This is a test of long-text capability. You need to first read the question below, and then answer it based on the information in the long document that follows.
Now please answer: {retrieval_question}
The content of the long document is as follows
<Document>
{context}
</Document>
'''
else:
raise ValueError(
f'Unsupported quesiton_position {quesiton_position}. '
'Position must be "End" or "Start".')
else:
raise ValueError(f"Language '{language}' is not supported.")
return prompt
file_names = [
'en_un_asr.jsonl', 'zh_all.jsonl', 'PaulGrahamEssays.jsonl',
'multi_needle_reasoning_en.json', 'multi_needle_reasoning_zh.json',
'zh_finance.jsonl', 'zh_game.jsonl', 'zh_general.jsonl',
'zh_government.jsonl', 'zh_movie.jsonl', 'zh_tech.jsonl'
]
path = get_data_path(path)
if os.environ.get('DATASET_SOURCE') == 'HF':
from huggingface_hub import snapshot_download
path = snapshot_download(repo_id=path, repo_type='dataset')
needle_file_path = os.path.join(path, needle_file_name)
for file_name in file_names:
file_path = os.path.join(path, file_name)
if file_name not in file_list:
continue
with open(file_path, 'r', encoding='utf-8') as f:
lines_bak = [json.loads(line.strip()) for line in f]
lines = lines_bak.copy()
for counter in range(num_repeats_per_file):
random.seed(counter)
random.shuffle(lines)
random_needle = get_random_line_by_language(
counter, needle_file_path, language)
needle = '\n' + random_needle['needle'] + '\n'
retrieval_question = random_needle['retrieval_question']
keyword = random_needle['keyword']
context_length = length - length_buffer
target_length_per_record = context_length - len(
_get_tokens_from_context(needle))
target_length_per_record = max(target_length_per_record, 0)
accumulated_tokens = []
for line in lines:
tokens_current_line = _get_tokens_from_context(
line['text'])
accumulated_tokens.extend(tokens_current_line)
if len(accumulated_tokens) >= target_length_per_record:
break
processed_text = _generate_context(
accumulated_tokens[:target_length_per_record], depth,
needle)
processed_prompt = _generate_prompt(processed_text,
retrieval_question)
data['prompt'].append(processed_prompt)
data['answer'].append(needle + '*' + keyword)
dataset = Dataset.from_dict({
'prompt': data['prompt'],
'answer': data['answer'],
})
return dataset
class NeedleBenchOriginEvaluator(BaseEvaluator):
def score(self, predictions, gold):
if len(predictions) != len(gold):
return {'error': 'predictions and gold have different lengths'}
total_score = 0
details = []
for prediction, reference in zip(predictions, gold):
keyword = reference.split('*')[1]
reference = reference.split('*')[0]
raw_prediction = prediction
prediction = re.sub(r'\s+', '', prediction)
reference = re.sub(r'\s+', '', reference)
if keyword in raw_prediction:
score = 100
else:
score = 0
detail = {'pred': prediction, 'answer': reference, 'score': score}
total_score += score
details.append(detail)
average_score = total_score / len(predictions) if predictions else 0
result = {'score': average_score, 'details': details}
return result
@TEXT_POSTPROCESSORS.register_module('needlebench')
def needlebench_postprocess(text: str) -> str:
return text
@TEXT_POSTPROCESSORS.register_module('needlebench_dataset_postprocess')
def needlebench_dataset_postprocess(text: str) -> str:
return text

View File

@ -0,0 +1,308 @@
# flake8: noqa: E501
import json
import os
import random
import tiktoken
from datasets import Dataset
from opencompass.datasets.base import BaseDataset
from opencompass.openicl import BaseEvaluator
from opencompass.registry import LOAD_DATASET
from opencompass.utils import get_data_path
def get_unique_entries(
file_path,
n,
language,
unique_arg1=False,
unique_arg2=False,
unique_combination=False,
):
seen_arg1 = set()
seen_arg2 = set()
seen_combinations = set()
results = []
with open(file_path, 'r', encoding='utf-8') as file:
lines = file.readlines()
random.shuffle(lines)
for line in lines:
try:
entry = json.loads(line.strip())
except json.JSONDecodeError:
continue
if entry.get('language') != language:
continue
key1 = entry.get('arg1', '') if unique_arg1 else ''
key2 = entry.get('arg2', '') if unique_arg2 else ''
combination = (key1, key2) if unique_combination else ''
if ((key1 not in seen_arg1 or not unique_arg1) # noqa: E501
and (key2 not in seen_arg2 or not unique_arg2)
and # noqa: E501
(combination not in seen_combinations
or not unique_combination)): # noqa: E501
seen_arg1.add(key1)
seen_arg2.add(key2)
seen_combinations.add(combination)
results.append(entry)
if len(results) == n:
break
return results
@LOAD_DATASET.register_module()
class NeedleBenchParallelDataset(BaseDataset):
@staticmethod
def load(
path: str,
needle_file_name: str,
length: int,
depths: list[int],
tokenizer_model: str,
file_list: list[str],
num_repeats_per_file: int,
length_buffer: int,
language: str,
quesiton_position: str = 'End',
):
data = {'prompt': [], 'answer': []}
tokenizer = tiktoken.encoding_for_model(tokenizer_model)
file_names = [
'PaulGrahamEssays.jsonl',
'multi_needle_reasoning_en.json',
'multi_needle_reasoning_zh.json',
'zh_finance.jsonl',
'zh_game.jsonl',
'zh_general.jsonl',
'zh_government.jsonl',
'zh_movie.jsonl',
'zh_tech.jsonl',
]
path = get_data_path(path)
if os.environ.get('DATASET_SOURCE') == 'HF':
from huggingface_hub import snapshot_download
path = snapshot_download(repo_id=path, repo_type='dataset')
needle_file_path = os.path.join(path, needle_file_name)
predefined_needles_bak = get_unique_entries(
needle_file_path,
len(depths),
language,
unique_arg1=True,
unique_arg2=True,
unique_combination=True,
)
def _generate_context(tokens_context, depths, needles):
insertion_points = [
int(len(tokens_context) * (depth / 100)) for depth in depths
]
cumulative_inserted_length = 0
for i, needle in enumerate(needles):
needle_tokens = _get_tokens_from_context(needle)
current_insertion_point = min(
insertion_points[i] + cumulative_inserted_length,
len(tokens_context),
)
tokens_context = (tokens_context[:current_insertion_point] +
needle_tokens +
tokens_context[current_insertion_point:])
cumulative_inserted_length += len(needle_tokens)
new_context = _decode_tokens(tokens_context)
return new_context
def _get_tokens_from_context(context):
if isinstance(context, list):
return [tokenizer.encode(item) for item in context]
else:
return tokenizer.encode(context)
def _decode_tokens(tokens):
return tokenizer.decode(tokens)
def _generate_prompt(context, retrieval_question):
if language == 'Chinese':
if quesiton_position == 'End':
prompt = f'''这是一个长文本能力的测试,你需要首先阅读下面的长文档,然后根据文档中的信息,依次回答最后的问题。
长文档的内容如下
<文档>
{context}
</文档>
根据文档中的信息现在请问{retrieval_question}
'''
elif quesiton_position == 'Start':
prompt = f'''这是一个长文本能力的测试,你需要首先阅读下面的问题,然后根据最后长文档中的信息,依次回答下面的问题。
现在请问{retrieval_question}
长文档内容的如下
<文档>
{context}
</文档>
'''
else:
raise ValueError(
f'Unsupported quesiton_position {quesiton_position}. '
'Position must be "End" or "Start".')
elif language == 'English':
if quesiton_position == 'End':
prompt = f'''This is a test of long-text capability. You need to first read the long document below, and then answer the final questions one by one based on the information in the document.
The content of the long document is as follows
<Document>
{context}
</Document>
Based on the information in the document, now please answer: {retrieval_question}
'''
elif quesiton_position == 'Start':
prompt = f'''This is a test of long-text capability. You need to first read the questions below, and then answer them one by one based on the information in the long document that follows.
Now please answer: {retrieval_question}
The content of the long document is as follows
<Document>
{context}
</Document>
'''
else:
raise ValueError(
f'Unsupported quesiton_position {quesiton_position}. '
'Position must be "End" or "Start".')
else:
raise ValueError(f"Language '{language}' is not supported.")
return prompt
for file_name in file_names:
file_path = os.path.join(path, file_name)
if file_name not in file_list:
continue
with open(file_path, 'r', encoding='utf-8') as f:
lines_bak = [json.loads(line.strip()) for line in f]
lines = lines_bak.copy()
for counter in range(num_repeats_per_file):
random.seed(counter)
random.shuffle(lines)
predefined_needles = predefined_needles_bak.copy()
random.seed(counter)
random.shuffle(predefined_needles)
needles = [
'\n' + item['needle'] + '\n' for item in predefined_needles
]
keywords = [item['arg2'] for item in predefined_needles]
if language == 'Chinese':
questions = ''.join([
item['retrieval_question'].split('')[0] + ''
for item in predefined_needles
])
answers_format = ''.join([
item['retrieval_question'].split("'")[1].split('')[0]
for item in predefined_needles
])
retrieval_question = (questions + "请按照'" + answers_format +
"'的格式回答。")
elif language == 'English':
questions = ''.join([
item['retrieval_question'].split('?')[0] + '?'
for item in predefined_needles
])
answers_format = ''.join([
item['retrieval_question'].split("'")[1].split('.')[0]
for item in predefined_needles
])
retrieval_question = (questions +
"Please answer in the format of '" +
answers_format + "'")
context_length = length - length_buffer
target_length_per_record = context_length - sum(
len(tokens)
for tokens in _get_tokens_from_context(needles))
target_length_per_record = max(target_length_per_record, 0)
accumulated_tokens = []
for line in lines:
tokens_current_line = _get_tokens_from_context(
line['text'])
accumulated_tokens.extend(tokens_current_line)
if len(accumulated_tokens) >= target_length_per_record:
break
processed_text = _generate_context(
accumulated_tokens[:target_length_per_record], depths,
needles)
processed_prompt = _generate_prompt(processed_text,
retrieval_question)
data['prompt'].append(processed_prompt)
data['answer'].append('*'.join(keywords) + '#' +
'*'.join(map(str, depths)))
dataset = Dataset.from_dict({
'prompt': data['prompt'],
'answer': data['answer'],
})
return dataset
class NeedleBenchParallelEvaluator(BaseEvaluator):
def score(self, predictions, gold):
if len(predictions) != len(gold):
return {'error': 'predictions and gold have different lengths'}
print('predictions:', predictions)
print('gold:', gold)
details = []
depths = [int(i) for i in gold[0].split('#')[1].split('*')]
scores_by_depth = {depth: 0 for depth in depths}
for prediction, reference in zip(predictions, gold):
print(reference)
keywords = reference.split('#')[0].split('*')
print(keywords)
for keyword, depth in zip(keywords, depths):
print('iterating:', keyword, depth)
if keyword in prediction:
print(f'{keyword} at depth {depth} is in {prediction}')
scores_by_depth[depth] += 100 / (len(predictions))
average_score = sum(scores_by_depth.values()) / len(scores_by_depth)
flattened_scores = {
'Depth' + str(depth): score
for depth, score in scores_by_depth.items()
}
result = {
**flattened_scores,
'details': details,
'average_score': average_score,
}
return result

View File

@ -61,15 +61,28 @@ model_name_mapping = {
'qwen1.5-4b-chat-hf': 'Qwen-1.5-4B',
'qwen1.5-14b-chat-hf': 'Qwen-1.5-14B',
'qwen1.5-72b-chat-hf': 'Qwen-1.5-72B',
'qwen1.5-1.8b-chat-vllm': 'Qwen-1.5-1.8B',
'qwen1.5-14b-chat-vllm': 'Qwen-1.5-14B-vLLM',
'qwen1.5-72b-chat-vllm': 'Qwen-1.5-72B-vLLM',
'glm4_notools': 'GLM-4',
'claude-3-opus': 'Claude-3-Opus',
'glm-4-9b-chat-1m-vllm': 'GLM4-9B-Chat-1M',
'internlm2_5-7b-chat-1m-turbomind': 'InternLM2.5-7B-Chat-1M',
'internlm3-8b-instruct-turbomind': 'InternLM3-8B-Instruct',
'llama-3.1-8b-instruct-vllm': 'LLaMA-3.1-8B',
'qwen2.5-1.5b-instruct-vllm': 'Qwen-2.5-1.5B',
'qwen2.5-7b-instruct-vllm': 'Qwen-2.5-7B',
'qwen2.5-14b-instruct-vllm': 'Qwen-2.5-14B',
'qwen2.5-32b-instruct-vllm': 'Qwen-2.5-32B',
'qwen2_5-72b-instruct-vllm': 'Qwen-2.5-72B',
'gemma-3-4b-it-vllm': 'Gemma-3-4B',
'gemma-3-12b-it-vllm': 'Gemma-3-12B',
'gemma-3-27b-it-vllm': 'Gemma-3-27B',
'glm-4-9b-chat-vllm': 'GLM4-9B-Chat',
'llama-3.1-8b-instruct-vllm': 'LLaMA-3.1-8B',
'llama-3.1-70b-instruct-vllm': 'LLaMA-3.1-70B',
# Add more mappings as necessary
}
dataset_mapping_dict = {}
needle_counts = ['2', '3', '4', '5']
@ -95,14 +108,19 @@ for t in types:
dataset_mapping_dict[key] = value
def calculate_elementwise_average(model_name, merged_df):
def calculate_elementwise_average(model_name, merged_df, mean=False):
score_columns = [col for col in merged_df.columns if col != 'dataset']
origin_columns = [col for col in score_columns if 'origin' in col]
parallel_columns = [col for col in score_columns if 'parallel' in col]
multi_columns = [col for col in score_columns if 'needle' in col]
if origin_columns and parallel_columns and multi_columns:
if origin_columns and parallel_columns and multi_columns and mean:
origin_avg = merged_df[origin_columns].mean(axis=1)
parallel_avg = merged_df[parallel_columns].mean(axis=1)
multi_avg = merged_df[multi_columns].mean(axis=1)
merged_df[model_name] = (origin_avg + parallel_avg + multi_avg) / 3
elif origin_columns and parallel_columns and multi_columns and not mean:
origin_avg = merged_df[origin_columns].mean(axis=1) * 0.4
parallel_avg = merged_df[parallel_columns].mean(axis=1) * 0.3
multi_avg = merged_df[multi_columns].mean(axis=1) * 0.3
@ -185,7 +203,7 @@ def remove_empty_subfolders(plot_path):
if not os.listdir(folder_path):
shutil.rmtree(folder_path)
def save_results_to_plots(txt_results_save_path):
def save_results_to_plots(txt_results_save_path, mean=False):
content = read_after_specific_line_except_last(txt_results_save_path, 'raw format', 2)
parsed_data = parse_model_scores(content)
model_names = get_dict_model_names(parsed_data)
@ -228,25 +246,25 @@ def save_results_to_plots(txt_results_save_path):
overall_dataset_abbrs = multi_dataset_abbrs + origin_dataset_abbrs + parallel_dataset_abbrs
overall_score_pic_path = os.path.join(plot_path, f'{model_name}_overall.png')
merged_df = merge_dataframes(model_name, overall_dataset_abbrs, parsed_data)
averaged_df = calculate_elementwise_average(model_name, merged_df)
averaged_df = calculate_elementwise_average(model_name, merged_df, mean=mean)
overall_score = visualize(averaged_df, overall_score_pic_path, model_name, 'Overall Score')
# Single-Retrieval
single_retrieval_score_pic_path = os.path.join(plot_path, f'{model_name}_single_retrieval_overall.png')
single_retrieval_merged_df = merge_dataframes(model_name, origin_dataset_abbrs, parsed_data)
single_retrieval_averaged_df = calculate_elementwise_average(model_name, single_retrieval_merged_df)
single_retrieval_averaged_df = calculate_elementwise_average(model_name, single_retrieval_merged_df, mean=mean)
single_retrieval_overall_score = visualize(single_retrieval_averaged_df, single_retrieval_score_pic_path, model_name, 'Single-Retrieval Overall Score')
# Multi-Retrieval
multi_retrieval_score_pic_path = os.path.join(plot_path, f'{model_name}_multi_retrieval_overall.png')
multi_retrieval_merged_df = merge_dataframes(model_name, parallel_dataset_abbrs, parsed_data)
multi_retrieval_averaged_df = calculate_elementwise_average(model_name, multi_retrieval_merged_df)
multi_retrieval_averaged_df = calculate_elementwise_average(model_name, multi_retrieval_merged_df, mean=mean)
multi_retrieval_overall_score = visualize(multi_retrieval_averaged_df, multi_retrieval_score_pic_path, model_name, 'Multi-Retrieval Overall Score')
# Multi-Reasoning
multi_reasoning_score_pic_path = os.path.join(plot_path, f'{model_name}_multi_reasoning_overall.png')
multi_reasoning_merged_df = merge_dataframes(model_name, multi_dataset_abbrs, parsed_data)
multi_reasoning_averaged_df = calculate_elementwise_average(model_name, multi_reasoning_merged_df)
multi_reasoning_averaged_df = calculate_elementwise_average(model_name, multi_reasoning_merged_df, mean=mean)
multi_reasoning_overall_score = visualize(multi_reasoning_averaged_df, multi_reasoning_score_pic_path, model_name, 'Multi-Reasoning Overall Score')
model_scores[model_name] = averaged_df
@ -279,7 +297,7 @@ def visualize(df_raw, save_path: str,model_name: str ,dataset_type:str):
mean_scores = pivot_table.mean().values
overall_score = mean_scores.mean()
plt.figure(figsize=(10, 6))
plt.figure(figsize=(7.5, 4.5))
ax = plt.gca()
cmap = LinearSegmentedColormap.from_list(
'custom_cmap', ['#F0496E', '#EBB839', '#0CD79F'])
@ -541,6 +559,42 @@ class NeedleBenchSummarizer(DefaultSummarizer):
# plot to show visualize results
save_results_to_plots(output_path)
class NeedleBenchSummarizerV2(NeedleBenchSummarizer):
"""NeedleBench summarizer V2 in OpenCompass.
This version calls save_results_to_plots with mean=True.
Args:
config (ConfigDict): The configuration object of the evaluation task. It's expected to be filled out at runtime.
dataset_abbrs (list[str], optional): Dataset abbreviations to be listed in the summary.
summary_groups (list): The dataset groups whose results need to be averaged out. For example, mmlu. Each item it a dict with
'name' (str) and 'subsets' (list of dataset abbrs), and optionally
'weights' if weighted average is needed.
prompt_db: A deprecated field.
"""
def summarize(
self,
output_path: str = None,
time_str: str = datetime.now().strftime('%Y%m%d_%H%M%S')): # noqa
raw_results, parsed_results, dataset_metrics, dataset_eval_mode = self._pick_up_results()
raw_results, parsed_results, dataset_metrics, dataset_eval_mode = \
self._calculate_group_metrics(raw_results, parsed_results, dataset_metrics, dataset_eval_mode)
table = self._format_table(parsed_results, dataset_metrics, dataset_eval_mode)
raw_txts = self._format_raw_txt(raw_results)
print(tabulate.tabulate(table, headers='firstrow'))
self._output_to_file(output_path, time_str, table, raw_txts)
if self.lark_reporter:
content = f'{getpass.getuser()}'
content += f'详细评测汇总已输出至 {osp.abspath(output_path)}'
self.lark_reporter.post(content)
if output_path is None:
output_path = osp.join(self.work_dir, 'summary', f'summary_{time_str}.txt')
# plot to show visualize results
save_results_to_plots(output_path, mean=True)
class NeedleBenchATCSummarizer(DefaultSummarizer):
"""NeedleBench-ATC summarizer in OpenCompass.