[Doc] Update NeedleBench Docs (#1330)

* update needlebench docs * update model_name_mapping dict * update README * Update README_zh-CN.md --------- Co-authored-by: Songyang Zhang <tonysy@users.noreply.github.com>
2025-05-30 16:03:24 +08:00 · 2024-07-18 13:16:19 +08:00 · 2024-07-18 13:16:19 +08:00 · 104bddf647
commit 104bddf647
parent 0a1c89e618
5 changed files with 47 additions and 19 deletions
--- a/README.md
+++ b/README.md
@ -70,6 +70,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through

 ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

+- **\[2024.07.17\]** We are excited to announce the release of NeedleBench's [technical report](http://arxiv.org/abs/2407.11963). We invite you to visit our [support documentation](https://opencompass.readthedocs.io/en/latest/advanced_guides/needleinahaystack_eval.html) for detailed evaluation guidelines. 🔥🔥🔥
 - **\[2024.07.04\]** OpenCompass now supports InternLM2.5, which has **outstanding reasoning capability**, **1M Context window and** and **stronger tool use**, you can try the models in [OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) and [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
 - **\[2024.06.20\]** OpenCompass now supports one-click switching between inference acceleration backends, enhancing the efficiency of the evaluation process. In addition to the default HuggingFace inference backend, it now also supports popular backends [LMDeploy](https://github.com/InternLM/lmdeploy) and [vLLM](https://github.com/vllm-project/vllm). This feature is available via a simple command-line switch and through deployment APIs. For detailed usage, see the [documentation](docs/en/advanced_guides/accelerator_intro.md).🔥🔥🔥.
 - **\[2024.05.08\]** We supported the evaluation of 4 MoE models: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py). Try them out now!
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -69,6 +69,7 @@

 ## 🚀 最新进展 <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>

+- **\[2024.07.17\]** 我们正式发布 NeedleBench 的[技术报告](http://arxiv.org/abs/2407.11963)。诚邀您访问我们的[帮助文档](https://opencompass.readthedocs.io/zh-cn/latest/advanced_guides/needleinahaystack_eval.html)进行评估。🔥🔥🔥
 - **\[2024.07.04\]** OpenCompass 现已支持 InternLM2.5， 它拥有卓越的推理性能、有效支持百万字超长上下文以及工具调用能力整体升级，欢迎访问[OpenCompass Config](https://github.com/open-compass/opencompass/tree/main/configs/models/hf_internlm) 和 [InternLM](https://github.com/InternLM/InternLM) .🔥🔥🔥.
 - **\[2024.06.20\]** OpenCompass 现已支持一键切换推理加速后端，助力评测过程更加高效。除了默认的HuggingFace推理后端外，还支持了常用的 [LMDeploy](https://github.com/InternLM/lmdeploy) 和 [vLLM](https://github.com/vllm-project/vllm) ，支持命令行一键切换和部署 API 加速服务两种方式，详细使用方法见[文档](docs/zh_cn/advanced_guides/accelerator_intro.md)。
  欢迎试用！🔥🔥🔥.
--- a/docs/en/advanced_guides/needleinahaystack_eval.md
+++ b/docs/en/advanced_guides/needleinahaystack_eval.md
@ -6,7 +6,7 @@ The Needle In A Haystack test (inspired by [NeedleInAHaystack](https://github.co

 ## Task Overview

-Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning:
+Within the `NeedleBench` framework of `OpenCompass`, we have designed a series of increasingly challenging test scenarios to comprehensively evaluate the models' abilities in long text information extraction and reasoning. For a complete introduction, refer to our [technical report](https://arxiv.org/abs/2407.11963):

 - **Single-Needle Retrieval Task (S-RT)**: Assesses an LLM's ability to extract a single key piece of information from a long text, testing its precision in recalling specific details within broad narratives. This corresponds to the **original Needle In A Haystack test** setup.

@ -77,11 +77,11 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b  --su

 ##### Evaluation on a Slurm Cluster

-If using `Slurm`, you can add parameters such as `--slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000`, as shown below:
+If using `Slurm`, you can add parameters such as `--slurm -p partition_name -q reserved --max-num-workers 16`, as shown below:

 ```bash
 # Slurm Evaluation
-python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
+python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
 ```

 ##### Evaluating a Subdataset Only
@ -89,13 +89,13 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b  --su
 If you only want to test the original NeedleInAHaystack task setup, you could change the dataset parameter to `needlebench_single_4k`, which corresponds to the single needle version of the NeedleInAHaystack test at 4k length:

 ```bash
-python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
+python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
 ```

 You can also choose to evaluate a specific subdataset, such as changing the `--datasets` parameter to `needlebench_single_4k/needlebench_zh_datasets` for testing just the Chinese version of the single needle 4K length NeedleInAHaystack task. The parameter after `/` represents the subdataset, which can be found in the dataset variable of `configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py` :

 ```bash
-python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
+python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
 ```

 Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool before starting the evaluation:
@ -104,7 +104,7 @@ Be sure to install the [LMDeploy](https://github.com/InternLM/lmdeploy) tool bef
 pip install lmdeploy
 ```

-This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 32` used to specify the Slurm partition name and the maximum number of worker processes.
+This command initiates the evaluation process, with parameters `-p partition_name -q auto` and `--max-num-workers 16` used to specify the Slurm partition name and the maximum number of worker processes.

 #### Evaluating Other `Huggingface` Models

@ -112,8 +112,10 @@ For other models, we recommend writing an additional configuration file to modif

 ```python
 from mmengine.config import read_base
+# We use mmengine.config to import variables from other configuration files
+
 with read_base():
-    from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
+    # from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
    from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b

    # Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
@ -131,7 +133,7 @@ with read_base():
 datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])

 for m in internlm2_chat_7b:
-    m['max_seq_len'] = 32768 # Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support.
+    m['max_seq_len'] = 30768 # Ensure InternLM2-7B model can receive the complete long text, other models need to adjust according to their maximum sequence length support.
    m['max_out_len'] = 2000 # Ensure that in the multi-needle recall task, the model can receive a complete response

 models = internlm2_chat_7b
@ -142,10 +144,10 @@ work_dir = './outputs/needlebench'
 Once the test `config` file is written, we can pass the corresponding config file path through the `run.py` file in the command line, such as:

 ```bash
-python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 128 --max-partition-size 8000
+python run.py configs/eval_needlebench.py --slurm -p partition_name -q reserved --max-num-workers 16
 ```

-Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-partition-size` setting to achieve the best task slicing strategy to improve evaluation efficiency.
+Note, at this point, we do not need to pass in the `--dataset, --models, --summarizer` parameters, as we have already defined these configurations in the config file. You can manually adjust the `--max-num-workers` setting to adjust the number of parallel workers.

 ### Visualization

@ -155,6 +157,16 @@ If you use this method, please add a reference:

 ```bibtex

+@misc{li2024needlebenchllmsretrievalreasoning,
+      title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?},
+      author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen},
+      year={2024},
+      eprint={2407.11963},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2407.11963},
+}
+
@misc{2023opencompass,
    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
    author={OpenCompass Contributors},
--- a/docs/zh_cn/advanced_guides/needleinahaystack_eval.md
+++ b/docs/zh_cn/advanced_guides/needleinahaystack_eval.md
@ -6,7 +6,7 @@

 ## 任务介绍

-在`OpenCompass`的`NeedleBench`框架中，为了全面评估模型在长文本信息提取和推理方面的能力，我们设计了一系列逐渐增加难度的测试方案。
+在`OpenCompass`的`NeedleBench`框架中，为了全面评估模型在长文本信息提取和推理方面的能力，我们设计了一系列逐渐增加难度的测试方案。完整的介绍参见我们的[技术报告](https://arxiv.org/abs/2407.11963)。

 - **单一信息检索任务(Single-Needle Retrieval Task, S-RT)**：评估LLM在长文本中提取单一关键信息的能力，测试其对广泛叙述中特定细节的精确回忆能力。这对应于**原始的大海捞针测试**任务设定。

@ -77,11 +77,11 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b  --su

 ##### 在Slurm集群上评估

-如果使用 `Slurm`，可以添加 `--slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000`等参数，例如下面：
+如果使用 `Slurm`，可以添加 `--slurm -p partition_name -q reserved --max-num-workers 16 `等参数，例如下面：

 ```bash
 # Slurm评估
-python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
+python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
 ```

 ##### 只评估子数据集
@ -89,13 +89,13 @@ python run.py --dataset needlebench_4k --models lmdeploy_internlm2_chat_7b  --su
 如果只想测试原始的大海捞针任务设定，比如可以更换数据集的参数为`needlebench_single_4k`，这对应于4k长度下的单针版本的大海捞针测试：

 ```bash
-python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
+python run.py --dataset needlebench_single_4k --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
 ```

 您也可以进一步选择子数据集，如更换数据集`--datasets`的参数为`needlebench_single_4k/needlebench_zh_datasets`，仅仅进行中文版本的单针4K长度下的大海捞针任务测试，其中`/`后面的参数代表子数据集，您可以在`configs/datasets/needlebench/needlebench_4k/needlebench_single_4k.py`中找到可选的子数据集变量，如：

 ```bash
-python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 32 --max-partition-size 8000
+python run.py --dataset needlebench_single_4k/needlebench_zh_datasets --models lmdeploy_internlm2_chat_7b  --summarizer needlebench/needlebench_4k_summarizer --slurm -p partition_name -q reserved --max-num-workers 16
 ```

 注意在评估前预先安装[LMDeploy](https://github.com/InternLM/lmdeploy)工具
@ -112,8 +112,10 @@ pip install lmdeploy

 ```python
 from mmengine.config import read_base
+# 我们使用mmengine.config来import其他的配置文件中的变量
+
 with read_base():
-    from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
+    # from .models.hf_internlm.lmdeploy_internlm2_chat_7b import models as internlm2_chat_7b_200k
    from .models.hf_internlm.hf_internlm2_chat_7b import models as internlm2_chat_7b

    # Evaluate needlebench_4k, adjust the configuration to use 8k, 32k, 128k, 200k, or 1000k if necessary.
@ -131,7 +133,7 @@ with read_base():
 datasets = sum([v for k, v in locals().items() if ('datasets' in k)], [])

 for m in internlm2_chat_7b:
-    m['max_seq_len'] = 32768 # 保证InternLM2-7B模型能接收到完整的长文本，其他模型需要根据各自支持的最大序列长度修改。
+    m['max_seq_len'] = 30768 # 保证InternLM2-7B模型能接收到完整的长文本，其他模型需要根据各自支持的最大序列长度修改。
    m['max_out_len'] = 2000 # 保证在多针召回任务中能接收到模型完整的回答

 models = internlm2_chat_7b
@ -142,10 +144,10 @@ work_dir = './outputs/needlebench'
 当书写好测试的`config`文件后，我们可以命令行中通过`run.py`文件传入对应的config文件路径，例如：

 ```bash
-python run.py configs/eval_needlebench.py  --slurm -p partition_name -q reserved --max-num-workers 128 --max-partition-size 8000
+python run.py configs/eval_needlebench.py  --slurm -p partition_name -q reserved --max-num-workers 16
 ```

-注意，此时我们不需传入`--dataset, --models, --summarizer `等参数，因为我们已经在config文件中定义了这些配置。你可以自己手动调节`--max-partition-size`的设定以实现最好的任务分片策略以提高评估效率。
+注意，此时我们不需传入`--dataset, --models, --summarizer `等参数，因为我们已经在config文件中定义了这些配置。你可以自己手动调节`--max-num-workers`的设定以调节并行工作的workers的数量。

 ### 可视化

@ -155,6 +157,16 @@ python run.py configs/eval_needlebench.py  --slurm -p partition_name -q reserved

 ```bibtex

+@misc{li2024needlebenchllmsretrievalreasoning,
+      title={NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?},
+      author={Mo Li and Songyang Zhang and Yunxin Liu and Kai Chen},
+      year={2024},
+      eprint={2407.11963},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2407.11963},
+}
+
@misc{2023opencompass,
    title={OpenCompass: A Universal Evaluation Platform for Foundation Models},
    author={OpenCompass Contributors},
--- a/opencompass/summarizers/needlebench.py
+++ b/opencompass/summarizers/needlebench.py
@ -65,6 +65,8 @@ model_name_mapping = {
    'qwen1.5-72b-chat-vllm': 'Qwen-1.5-72B-vLLM',
    'glm4_notools': 'GLM-4',
    'claude-3-opus': 'Claude-3-Opus',
+    'glm-4-9b-chat-1m-vllm': 'GLM4-9B-Chat-1M',
+    'internlm2_5-7b-chat-1m-turbomind': 'InternLM2.5-7B-Chat-1M',
    # Add more mappings as necessary
 }