Merge branch 'update_needlebench_docs' into needlebench_v2_pr

This commit is contained in:
Mor-Li 2025-05-13 14:19:48 +08:00
commit f7242fdea8
3 changed files with 96 additions and 153 deletions

View File

@ -1,52 +1,24 @@
# Needle In A Haystack Experimental Evaluation # Needle In A Haystack Evaluation
## Introduction to the Needle In A Haystack Test ## Introduction to the Needle In A Haystack Test
The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py)) is an evaluation method that randomly inserts key information into long texts to form prompts for large language models (LLMs). The test aims to detect whether large models can extract such key information from extensive texts, thereby assessing the models' capabilities in processing and understanding long documents. The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/blob/main/LLMNeedleHaystackTester.py)) is an evaluation method where key information is randomly inserted into long texts to form the prompt for large language models (LLMs). This test aims to assess whether LLMs can extract critical information from long texts, thereby evaluating their fundamental ability to comprehend and process long-context documents.
## Task Overview ## Task Overview
Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning. For a complete introduction, refer to our [technical report](https://arxiv.org/abs/2407.11963): Within the `OpenCompass` framework, under `NeedleBench`, we designed a series of progressively challenging evaluation tasks to comprehensively assess LLMs' long-text information extraction and reasoning capabilities. For a complete description, please refer to our [technical report](https://arxiv.org/abs/2407.11963).
- **Single-Needle Retrieval Task (S-RT)**: Assesses an LLM's ability to extract a single key piece of information from a long text, testing its precision in recalling specific details within broad narratives. This corresponds to the **original Needle In A Haystack test** setup. - **Single-Needle Retrieval Task (S-RT)**: Evaluates the LLM's ability to retrieve a single piece of key information from a long text, testing precise recall of specific details within extensive narratives. This corresponds to the **original Needle In A Haystack test** setup.
- **Multi-Needle Retrieval Task (M-RT)**: Explores an LLM's capability to retrieve multiple related pieces of information from long texts, simulating real-world scenarios of complex queries on comprehensive documents. - **Multi-Needle Retrieval Task (M-RT)**: Explores the LLM's ability to retrieve multiple relevant pieces of information from long texts, simulating complex queries over comprehensive documents.
- **Multi-Needle Reasoning Task (M-RS)**: Evaluates an LLM's long-text abilities by extracting and utilizing multiple key pieces of information, requiring the model to have a comprehensive understanding of each key information fragment. - **Multi-Needle Reasoning Task (M-RS)**: Assesses LLMs' abilities to integrate multiple key pieces of information extracted from long texts for reasoning, requiring a comprehensive understanding of content.
- **Ancestral Trace Challenge (ATC)**: Uses the "relational needle" to test an LLM's ability to handle multi-layer logical challenges in real long texts. In the ATC task, a series of logical reasoning questions are used to test the model's memory and analytical skills for every detail in the text. For this task, we remove the irrelevant text (Haystack) setting, designing all texts as critical information, requiring the LLM to use all the content and reasoning in the text accurately to answer the questions. - **Ancestral Trace Challenge (ATC)**: Tests LLMs' capabilities in handling multi-layer logical challenges within realistic long-text contexts through "kinship trace needles." In the ATC task, no irrelevant (haystack) texts are added; every piece of text is critical, and models must reason through all details for accurate answers.
### Evaluation Steps ## Evaluation Steps
> Note: In the latest code, OpenCompass has been set to automatically load the dataset from [Huggingface API](https://huggingface.co/datasets/opencompass/NeedleBench), so you can **skip directly** the following steps of manually downloading and placing the dataset. > Note: In the latest `OpenCompass` codebase, the NeedleBench dataset is automatically loaded from the [Huggingface interface](https://huggingface.co/datasets/opencompass/NeedleBench), with no need for manual download or configuration.
1. Download the dataset from [here](https://github.com/open-compass/opencompass/files/14741330/needlebench.zip).
2. Place the downloaded files in the `opencompass/data/needlebench/` directory. The expected file structure in the `needlebench` directory is shown below:
```
opencompass/
├── configs
├── docs
├── data
│ └── needlebench
│ ├── multi_needle_reasoning_en.json
│ ├── multi_needle_reasoning_zh.json
│ ├── names.json
│ ├── needles.jsonl
│ ├── PaulGrahamEssays.jsonl
│ ├── zh_finance.jsonl
│ ├── zh_game.jsonl
│ ├── zh_government.jsonl
│ ├── zh_movie.jsonl
│ ├── zh_tech.jsonl
│ ├── zh_general.jsonl
├── LICENSE
├── opencompass
├── outputs
├── run.py
├── more...
```
### `OpenCompass` Environment Setup ### `OpenCompass` Environment Setup
@ -58,107 +30,107 @@ cd opencompass
pip install -e . pip install -e .
``` ```
### Configuring the Dataset ### Dataset Configuration
We have pre-configured datasets for common text lengths (4k, 8k, 32k, 128k, 200k, 1000k) in `configs/datasets/needlebench`, allowing you to flexibly create datasets that meet your needs by defining related parameters in the configuration files. We have pre-configured various long-context settings (4k, 8k, 32k, 128k, 200k, 1000k) in `opencompass/configs/datasets/needlebench`, and you can flexibly define your parameters by adjusting the configuration files.
### Evaluation Example ### Evaluation Example
#### Evaluating `InternLM2-7B` Model Deployed Using `LMDeploy` #### Evaluating with `VLLM` Deployed `Qwen2-5-7B` Model
For example, to evaluate the `InternLM2-7B` model deployed using `LMDeploy` for all tasks in NeedleBench-4K, you can directly use the following command in the command line. This command calls the pre-defined model and dataset configuration files without needing to write additional configuration files: To evaluate the `Qwen2-5-7B` model deployed with `VLLM` on all tasks under NeedleBench-128K, use the following command. This leverages pre-defined model and dataset configuration files without needing additional configuration:
##### Local Evaluation ##### Local Evaluation
If you are evaluating the model locally, the command below will utilize all available GPUs on your machine. You can limit the GPU access for `OpenCompass` by setting the `CUDA_VISIBLE_DEVICES` environment variable. For instance, using `CUDA_VISIBLE_DEVICES=0,1,2,3 python run.py ...` will only expose the first four GPUs to OpenCompass, ensuring that it does not use more than these four GPUs. If evaluating locally, the command will use all available GPUs. You can control GPU visibility using `CUDA_VISIBLE_DEVICES`:
```bash ```bash
# Local Evaluation # Local evaluation
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer python run.py --dataset needlebench_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_128k_summarizer
``` ```
##### Evaluation on a Slurm Cluster ##### Evaluation on Slurm Cluster
If using `Slurm`, you can add parameters such as `--slurm -p partition_name -q reserved --max-num-workers 16`, as shown below: For Slurm environments, you can add options like `--slurm -p partition_name -q reserved --max-num-workers 16`:
```bash ```bash
# Slurm Evaluation # Slurm evaluation
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16 python run.py --dataset needlebench_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
``` ```
##### Evaluating a Subdataset Only ##### Evaluating Specific Subsets
If you only want to test the original NeedleInAHaystack task setup, you could change the dataset parameter to `needlebench_single_4k`, which corresponds to the single needle version of the NeedleInAHaystack test at 4k length: If you only want to test the original Needle In A Haystack task (e.g., single-needle 128k), adjust the dataset parameter:
```bash ```bash
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16 python run.py --dataset needlebench_single_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
``` ```
You can also choose to evaluate a specific subdataset, such as changing the `--datasets` parameter to `needlebench_single_4k/needlebench_zh_datasets` for testing just the Chinese version of the single needle 4K length NeedleInAHaystack task. The parameter after `/` represents the subdataset, which can be found in the dataset variable of `configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py` : To evaluate only Chinese versions, specify the subset dataset after `/`:
```bash ```bash
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16 python run.py --dataset needlebench_single_128k/needlebench_zh_datasets --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
``` ```
Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool before starting the evaluation: Ensure `VLLM` is installed beforehand:
```bash ```bash
pip install lmdeploy # Install vLLM with CUDA 12.4.
# For other CUDA versions, please refer to the [official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html)
pip install vllm
``` ```
This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 16` used to specify the Slurm partition name and the maximum number of worker processes.
#### Evaluating Other `Huggingface` Models #### Evaluating Other `Huggingface` Models
For other models, we recommend writing an additional configuration file to modify the model's `max_seq_len` and `max_out_len` parameters so the model can receive the complete long text content, as we have prepared in the `configs/eval_needlebench.py` file. The complete content is as follows: For other models, it's recommended to create a custom config file to adjust `max_seq_len` and `max_out_len`, ensuring the model can process the full context. Here is an example (`examples/eval_needlebench.py`):
```python ```python
from mmengine.config import read_base from mmengine.config import read_base
# We use mmengine.config to import variables from other configuration files # we use mmengine.config to import other config files
with read_base(): with read_base():
# from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary. # Evaluate needlebench_32k, adjust the configuration to use 4k, 32k, 128k, 200k, or 1000k if necessary.
# from .datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets # from .datasets.needlebench.needlebench_32k.needlebench_32k import needlebench_datasets
# from .summarizers.needlebench import needlebench_4k_summarizer as summarizer # from .summarizers.needlebench import needlebench_32k_summarizer as summarizer
# only eval original "needle in a haystack test" in needlebench_4k # only eval original "needle in a haystack test" in needlebench_32k
from .datasets.needlebench.needlebench_4k.needlebench_single_4k import needlebench_zh_datasets, needlebench_en_datasets from opencompass.configs.datasets.needlebench.needlebench_32k.needlebench_single_32k import needlebench_zh_datasets, needlebench_en_datasets
from .summarizers.needlebench import needlebench_4k_summarizer as summarizer from opencompass.configs.summarizers.needlebench import needlebench_32k_summarizer as summarizer
# eval Ancestral Tracing Challenge(ATC) # eval Ancestral Tracing Challenge(ATC)
# from .datasets.needlebench.atc.atc_choice_50 import needlebench_datasets # from .datasets.needlebench.atc.atc_0shot_nocot_2_power_en import needlebench_datasets
# from .summarizers.needlebench import atc_summarizer_50 as summarizer # ATC use default summarizer thus no need to import summarizer
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], []) datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
for m in internlm2_chat_7b: for m in internlm2_chat_7b:
m['max_seq_len'] = 30768 # Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support. m['max_seq_len'] = 32768 # 保证InternLM2-7B模型能接收到完整的长文本其他模型需要根据各自支持的最大序列长度修改。
m['max_out_len'] = 2000 # Ensure that in the multi-needle recall task, the model can receive a complete response m['max_out_len'] = 4096
models = internlm2_chat_7b models = internlm2_chat_7b
work_dir = './outputs/needlebench' work_dir = './outputs/needlebench'
``` ```
Once the test `config` file is written, we can pass the corresponding config file path through the `run.py` file in the command line, such as: You can then run evaluation with:
```bash ```bash
python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16 python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16
``` ```
Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-num-workers` setting to adjust the number of parallel workers. No need to manually specify `--dataset`, `--models`, or `--summarizer` again.
### Visualization ### Visualization
We have built-in result visualization into the `summarizer` implementation in the latest code version. You can find the corresponding visualizations in the plots directory of the respective output folder, eliminating the need for manual visualization of scores across various depths and lengths. NeedleBench's latest version has built-in visualization integrated into the summarizer. You can find corresponding visualizations in the `plots` directory under the output folder without needing additional scripts.
If you use this method, please add a reference: ### Citation
If you use NeedleBench, please cite us:
```bibtex ```bibtex
@misc{li2024needlebenchllmsretrievalreasoning, @misc{li2024needlebenchllmsretrievalreasoning,
title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?}, title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?},
author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen}, author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen},
@ -174,8 +146,6 @@ If you use this method, please add a reference:
author={OpenCompass Contributors}, author={OpenCompass Contributors},
howpublished={\url{https://github.com/open-compass/opencompass}}, howpublished={\url{https://github.com/open-compass/opencompass}},
year={2023} year={2023}
} }
@misc{LLMTest_NeedleInAHaystack, @misc{LLMTest_NeedleInAHaystack,
@ -187,11 +157,10 @@ If you use this method, please add a reference:
@misc{wei2023skywork, @misc{wei2023skywork,
title={Skywork: A More Open Bilingual Foundation Model}, title={Skywork: A More Open Bilingual Foundation Model},
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou}, author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei L\"u and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
year={2023}, year={2023},
eprint={2310.19341}, eprint={2310.19341},
archivePrefix={arXiv}, archivePrefix={arXiv},
primaryClass={cs.CL} primaryClass={cs.CL}
} }
``` ```

View File

@ -16,37 +16,9 @@
- **祖先追溯挑战(Ancestral Trace Challenge, ATC)**通过设计“亲属关系针”测试LLM处理真实长文本中多层逻辑挑战的能力。在ATC任务中通过一系列逻辑推理问题检验模型对长文本中每个细节的记忆和分析能力在此任务中我们去掉了无关文本(Haystack)的设定而是将所有文本设计为关键信息LLM必须综合运用长文本中的所有内容和推理才能准确回答问题。 - **祖先追溯挑战(Ancestral Trace Challenge, ATC)**通过设计“亲属关系针”测试LLM处理真实长文本中多层逻辑挑战的能力。在ATC任务中通过一系列逻辑推理问题检验模型对长文本中每个细节的记忆和分析能力在此任务中我们去掉了无关文本(Haystack)的设定而是将所有文本设计为关键信息LLM必须综合运用长文本中的所有内容和推理才能准确回答问题。
### 评估步骤 ## 评估步骤
> 注意在最新代码中OpenCompass已经设置数据集从[Huggingface的接口](https://huggingface.co/datasets/opencompass/NeedleBench)中自动加载,可以直接跳过下面的手动下载安放数据集。 > 注意在最新的OpenCompass代码中NeedleBench数据集会自动从[Huggingface接口](https://huggingface.co/datasets/opencompass/NeedleBench)加载,无需手动下载或配置数据集,您可以直接运行评测命令。
1. 从[这里](https://github.com/open-compass/opencompass/files/14741330/needlebench.zip)下载数据集。
2. 将下载的文件放置于`opencompass/data/needlebench/`目录下。`needlebench`目录中预期的文件结构如下所示:
```
opencompass/
├── configs
├── docs
├── data
│ └── needlebench
│ ├── multi_needle_reasoning_en.json
│ ├── multi_needle_reasoning_zh.json
│ ├── names.json
│ ├── needles.jsonl
│ ├── PaulGrahamEssays.jsonl
│ ├── zh_finance.jsonl
│ ├── zh_game.jsonl
│ ├── zh_government.jsonl
│ ├── zh_movie.jsonl
│ ├── zh_tech.jsonl
│ ├── zh_general.jsonl
├── LICENSE
├── opencompass
├── outputs
├── run.py
├── more...
```
### `OpenCompass`环境配置 ### `OpenCompass`环境配置
@ -60,13 +32,13 @@ pip install -e .
### 配置数据集 ### 配置数据集
我们在`configs/datasets/needlebench`中已经预先配置好了关于常见长度区间(4k, 8k, 32k, 128k, 200k, 1000k)的长文本测试设定,您可以通过在配置文件中定义相关参数,以灵活地创建适合您需求的数据集。 我们在`opencompass/configs/datasets/needlebench`中已经预先配置好了关于常见长度区间(4k, 8k, 32k, 128k, 200k, 1000k)的长文本测试设定,您可以通过在配置文件中定义相关参数,以灵活地创建适合您需求的数据集。
### 评估示例 ### 评估示例
#### 使用`LMDeploy`部署的 `InternLM2-7B` 模型进行评估 #### 使用`VLLM`部署的 `Qwen2-5-7B` 模型进行评估
例如,使用`LMDeploy`部署的 `InternLM2-7B` 模型进行评估NeedleBench-4K的所有任务可以在命令行中直接使用以下命令该命令会调用预定义好的模型、数据集配置文件而无需额外书写配置文件 例如,使用`VLLM`部署的 `Qwen2-5-7B` 模型进行评估NeedleBench-128K的所有任务可以在命令行中直接使用以下命令该命令会调用预定义好的模型、数据集配置文件而无需额外书写配置文件
##### 本地评估 ##### 本地评估
@ -74,7 +46,7 @@ pip install -e .
```bash ```bash
# 本地评估 # 本地评估
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer python run.py --dataset needlebench_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_128k_summarizer
``` ```
##### 在Slurm集群上评估 ##### 在Slurm集群上评估
@ -83,60 +55,62 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --su
```bash ```bash
# Slurm评估 # Slurm评估
python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16 python run.py --dataset needlebench_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
``` ```
##### 只评估子数据集 ##### 只评估子数据集
如果只想测试原始的大海捞针任务设定,比如可以更换数据集的参数为`needlebench_single_4k`这对应于4k长度下的单针版本的大海捞针测试 如果只想测试原始的大海捞针任务设定,比如可以更换数据集的参数为`needlebench_single_128k`这对应于128k长度下的单针版本的大海捞针测试
```bash ```bash
python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16 python run.py --dataset needlebench_single_128k --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
``` ```
您也可以进一步选择子数据集,如更换数据集`--datasets`的参数为`needlebench_single_4k/needlebench_zh_datasets`仅仅进行中文版本的单针4K长度下的大海捞针任务测试其中`/`后面的参数代表子数据集,您可以在`configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`中找到可选的子数据集变量,如: 您也可以进一步选择子数据集,如更换数据集`--datasets`的参数为`needlebench_single_128k/needlebench_zh_datasets`仅仅进行中文版本的单针128k长度下的大海捞针任务测试其中`/`后面的参数代表子数据集,您可以在`opencompass/configs/datasets/needlebench/needlebench_128k/needlebench_single_128k.py`中找到可选的子数据集变量,如:
```bash ```bash
python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16 python run.py --dataset needlebench_single_128k/needlebench_zh_datasets --models vllm_qwen2_5_7b_instruct_128k --summarizer needlebench/needlebench_128k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
``` ```
注意在评估前预先安装[LMDeploy](https://github.com/InternLM/lmdeploy)工具 注意在评估前预先安装[VLLM](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html)工具
```bash ```bash
pip install lmdeploy # Install vLLM with CUDA 12.4.
# For other CUDA versions, please refer to the [official documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html)
pip install vllm
``` ```
这个命令将启动评估流程,参数 `-p partition_name -q auto``--max-num-workers 32` 用于指定 Slurm 分区名称和最大工作进程数。 这个命令将启动评估流程,其中参数 `-p partition_name` 用于指定 Slurm 分区名称,`-q auto` 用于指定 quota type资源队列类型例如 auto、reserved 等),`--max-num-workers 32` 用于设置最大工作进程数。
#### 评估其他`Huggingface`模型 #### 评估其他`Huggingface`模型
对于其他模型,我们建议额外书写一个运行的配置文件以便对模型的`max_seq_len`, `max_out_len`参数进行修改,以便模型可以接收到完整的长文本内容。如我们预先写好的`configs/eval_needlebench.py`文件。完整内容如下 对于其他模型,我们建议额外书写一个运行的配置文件以便对模型的`max_seq_len`, `max_out_len`参数进行修改,以便模型可以接收到完整的长文本内容。如这里的的`examples/eval_needlebench.py`文件。完整内容如下
```python ```python
from mmengine.config import read_base from mmengine.config import read_base
# 我们使用mmengine.config来import其他的配置文件中的变量 # we use mmengine.config to import other config files
with read_base(): with read_base():
# from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary. # Evaluate needlebench_32k, adjust the configuration to use 4k, 32k, 128k, 200k, or 1000k if necessary.
# from .datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets # from .datasets.needlebench.needlebench_32k.needlebench_32k import needlebench_datasets
# from .summarizers.needlebench import needlebench_4k_summarizer as summarizer # from .summarizers.needlebench import needlebench_32k_summarizer as summarizer
# only eval original "needle in a haystack test" in needlebench_4k # only eval original "needle in a haystack test" in needlebench_32k
from .datasets.needlebench.needlebench_4k.needlebench_single_4k import needlebench_zh_datasets, needlebench_en_datasets from opencompass.configs.datasets.needlebench.needlebench_32k.needlebench_single_32k import needlebench_zh_datasets, needlebench_en_datasets
from .summarizers.needlebench import needlebench_4k_summarizer as summarizer from opencompass.configs.summarizers.needlebench import needlebench_32k_summarizer as summarizer
# eval Ancestral Tracing Challenge(ATC) # eval Ancestral Tracing Challenge(ATC)
# from .datasets.needlebench.atc.atc_choice_50 import needlebench_datasets # from .datasets.needlebench.atc.atc_0shot_nocot_2_power_en import needlebench_datasets
# from .summarizers.needlebench import atc_summarizer_50 as summarizer # ATC use default summarizer thus no need to import summarizer
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], []) datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
for m in internlm2_chat_7b: for m in internlm2_chat_7b:
m['max_seq_len'] = 30768 # 保证InternLM2-7B模型能接收到完整的长文本其他模型需要根据各自支持的最大序列长度修改。 m['max_seq_len'] = 32768 # 保证InternLM2-7B模型能接收到完整的长文本其他模型需要根据各自支持的最大序列长度修改。
m['max_out_len'] = 2000 # 保证在多针召回任务中能接收到模型完整的回答 m['max_out_len'] = 4096
models = internlm2_chat_7b models = internlm2_chat_7b
@ -155,6 +129,8 @@ python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved
我们已经在最新的代码中将结果可视化内置到`summarizer`实现中您在对应的output文件夹的plots目录下可以看到相应的可视化。而不需要自己手动可视化各个深度和长度下的分数。 我们已经在最新的代码中将结果可视化内置到`summarizer`实现中您在对应的output文件夹的plots目录下可以看到相应的可视化。而不需要自己手动可视化各个深度和长度下的分数。
### 引用
如果使用了该方法,请添加引用: 如果使用了该方法,请添加引用:
```bibtex ```bibtex

View File

@ -1,29 +1,27 @@
from mmengine.config import read_base from mmengine.config import read_base
# we use mmengine.config to import other config files
with read_base(): with read_base():
# Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary. from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b
# from opencompass.configs.datasets.needlebench.needlebench_4k.needlebench_4k import needlebench_datasets
# from opencompass.configs.summarizers.needlebench import needlebench_4k_summarizer as summarizer # Evaluate needlebench_32k, adjust the configuration to use 4k, 32k, 128k, 200k, or 1000k if necessary.
# only eval original "needle in a haystack test" in needlebench_4k # from .datasets.needlebench.needlebench_32k.needlebench_32k import needlebench_datasets
from opencompass.configs.datasets.needlebench.needlebench_4k.needlebench_single_4k import ( # from .summarizers.needlebench import needlebench_32k_summarizer as summarizer
needlebench_en_datasets, needlebench_zh_datasets)
from opencompass.configs.models.hf_internlm.hf_internlm2_chat_7b import \ # only eval original "needle in a haystack test" in needlebench_32k
models as internlm2_chat_7b from opencompass.configs.datasets.needlebench.needlebench_32k.needlebench_single_32k import needlebench_zh_datasets, needlebench_en_datasets
from opencompass.configs.models.hf_internlm.lmdeploy_internlm2_chat_7b import \ from opencompass.configs.summarizers.needlebench import needlebench_32k_summarizer as summarizer
models as internlm2_chat_7b_200k
from opencompass.configs.summarizers.needlebench import \
needlebench_4k_summarizer as summarizer
# eval Ancestral Tracing Challenge(ATC) # eval Ancestral Tracing Challenge(ATC)
# from opencompass.configs.datasets.needlebench.atc.atc_choice_50 import needlebench_datasets # from .datasets.needlebench.atc.atc_0shot_nocot_2_power_en import needlebench_datasets
# from opencompass.configs.summarizers.needlebench import atc_summarizer_50 as summarizer # ATC use default summarizer thus no need to import summarizer
datasets = sum([v for k, v in locals().items() if ('datasets' in k)], []) datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])
for m in internlm2_chat_7b: for m in internlm2_chat_7b:
m['max_seq_len'] = 32768 # Ensure InternLM2-7B model can receive the full length of long texts, adjust for other models based on their supported maximum sequence length. m['max_seq_len'] = 32768 # 保证InternLM2-7B模型能接收到完整的长文本其他模型需要根据各自支持的最大序列长度修改。
m['max_out_len'] = 2000 # Ensure complete responses from the model in multi-needle retrieval tasks. m['max_out_len'] = 4096
models = internlm2_chat_7b models = internlm2_chat_7b
work_dir = './outputs/needlebench' work_dir = './outputs/needlebench'